1. Import packages

In [1]:
# This first set of packages include Pandas, for data manipulation, numpy for mathematical computation and matplotlib & seaborn, for visualisation.
import pandas as pd
import numpy as np
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style='white', context='notebook', palette='deep')
print('Data Manipulation, Mathematical Computation and Visualisation packages imported!')

# Statistical packages used for transformations
from scipy import stats
from scipy.stats import skew, norm
from scipy.special import boxcox1p
from scipy.stats.stats import pearsonr
print('Statistical packages imported!')

# Metrics used for measuring the accuracy and performance of the models
#from sklearn import metrics
#from sklearn.metrics import mean_squared_error
print('Metrics packages imported!')

# Algorithms used for modeling
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.kernel_ridge import KernelRidge
import xgboost as xgb
print('Algorithm packages imported!')

# Pipeline and scaling preprocessing will be used for models that are sensitive
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
print('Pipeline and preprocessing packages imported!')

# Model selection packages used for sampling dataset and optimising parameters
from sklearn import model_selection
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit
print('Model selection packages imported!')

# Set visualisation colours
mycols = ["#66c2ff", "#5cd6d6", "#00cc99", "#85e085", "#ffd966", "#ffb366", "#ffb3b3", "#dab3ff", "#c2c2d6"]
sns.set_palette(palette = mycols, n_colors = 4)
print('My colours are ready! :)')

# To ignore annoying warning
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)
warnings.filterwarnings("ignore", category=DeprecationWarning)
print('Deprecation warning will be ignored!')
Data Manipulation, Mathematical Computation and Visualisation packages imported!
Statistical packages imported!
Metrics packages imported!
Algorithm packages imported!
Pipeline and preprocessing packages imported!
Model selection packages imported!
My colours are ready! :)
Deprecation warning will be ignored!

2. Load Data

  • The Pandas package helps us work with our datasets. We start by reading the training and test datasets into DataFrames.
  • We want to save the 'Id' columns from both datasets for later use when preparing the submission data.
  • But we can drop them from the training and test datasets as they are redundant.
In [2]:
train = pd.read_csv('./inputs/train.csv')
test = pd.read_csv('./inputs/test.csv')

# Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']

# Now drop the  'Id' column as it's redundant for modeling
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

print(train.shape)
print(test.shape)
train.head()
(1460, 80)
(1459, 79)
Out[2]:
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 80 columns

3. Data Preparation

3.1 - Remove outliers

  • Outliers will sit way outside of the distribution of data points. Hence, this will skew the distribution of the data and potential calculations.
  • The chart on the left shows the data before removing the outliers, and the chart on the right shows after.
In [3]:
plt.subplots(figsize=(15, 5))

plt.subplot(1, 2, 1)
g = sns.regplot(x=train['GrLivArea'], y=train['SalePrice'], fit_reg=False).set_title("Before")

# Delete outliers
plt.subplot(1, 2, 2)                                                                                
train = train.drop(train[(train['GrLivArea']>4000)].index)
g = sns.regplot(x=train['GrLivArea'], y=train['SalePrice'], fit_reg=False).set_title("After")

3.2 - Treat missing values

If you have missing values, you have two options:

  • Delete the entire row
  • Fill the missing entry with an imputed value

In order to clean this dataset, we will create a dataset of the training and test data through concatenation in order to make changes consistent across both. Then, we will cycle through each feature with missing values and treat them individually based on the data description, or personal judgement.

In [4]:
# First of all, save the length of the training and test data for use later
ntrain = train.shape[0]
ntest = test.shape[0]

# Also save the target value, as we will remove this
y_train = train.SalePrice.values

# concatenate training and test data into all_data
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)

print("all_data shape: {}".format(all_data.shape))
all_data shape: (2915, 79)
In [5]:
# aggregate all null values 
all_data_na = all_data.isnull().sum()

# get rid of all the values with 0 missing values
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
plt.subplots(figsize =(16, 10))
all_data_na.plot(kind='bar');

Through reference of the data description, this gives guidance on how to treat missing values for some columns. For ones where guidance isn't provided, Personal judgement will be applied.

In [6]:
# Using data description, fill these missing values with "None"
for col in ("PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu",
           "GarageType", "GarageFinish", "GarageQual", "GarageCond",
           "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1",
            "BsmtFinType2", "MSSubClass", "MasVnrType"):
    all_data[col] = all_data[col].fillna("None")
print("'None' - treated...")

# The area of the lot out front is likely to be similar to the houses in the local neighbourhood
# Therefore, let's use the median value of the houses in the neighbourhood to fill this feature
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))
print("'LotFrontage' - treated...")

# Using data description, fill these missing values with 0 
for col in ("GarageYrBlt", "GarageArea", "GarageCars", "BsmtFinSF1", 
           "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "MasVnrArea",
           "BsmtFullBath", "BsmtHalfBath"):
    all_data[col] = all_data[col].fillna(0)
print("'0' - treated...")


# Fill these features with their mode, the most commonly occuring value. This is okay since there are a low number of missing values for these features
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
all_data["Functional"] = all_data["Functional"].fillna(all_data['Functional'].mode()[0])
print("'mode' - treated...")

all_data_na = all_data.isnull().sum()
print("Features with missing values: ", all_data_na.drop(all_data_na[all_data_na == 0].index))
'None' - treated...
'LotFrontage' - treated...
'0' - treated...
'mode' - treated...
Features with missing values:  Utilities    2
dtype: int64

We see that we have 1 remaining feature with missing values, Utilities. We will further analyse it.

In [7]:
plt.subplots(figsize =(15, 5))

plt.subplot(1, 2, 1)
g = sns.countplot(x = "Utilities", data = train).set_title("Utilities - Training")

plt.subplot(1, 2, 2)
g = sns.countplot(x = "Utilities", data = test).set_title("Utilities - Test")

This tell us that within the training dataset, Utilities has two unique values: "AllPub" and "NoSeWa". With "AllPub" being by far the most common.

  • However, the test dataset has only 1 value for this column, which means that it holds no predictive power because it is a constant for all test observations. Therefore, we can drop this column
In [8]:
# From inspection, we can remove Utilities
all_data = all_data.drop(['Utilities'], axis=1)

all_data_na = all_data.isnull().sum()
print("Features with missing values: ", len(all_data_na.drop(all_data_na[all_data_na == 0].index)))
Features with missing values:  0

4. Exploratory Data Analysis

4.1 Correlation matrix

We will next analyse each feature in more detail since missing values and outliers have already been treated. This will give guidance on how to prepare this feature for modeling. WE will analyse the features based on the different aspects of the house available in the dataset.

In [9]:
import itertools
corr = train.corr()
plt.subplots(figsize=(30, 30))
cmap = sns.diverging_palette(150, 250, as_cmap=True)
sns.heatmap(corr, cmap="RdYlBu", vmax=1, vmin=-0.6, center=0.2, square=True, linewidths=0, cbar_kws={"shrink": .5}, annot = True);
  • Using this correlation matrix, we are able to visualise the highly influencing factors on SalePrice.
  • We will create polynomial/derived features from the highly correlating features, in an attempt to capture the complex non-linear relationships within the data.

4.2 Feature engineering

4.2.1 Polynomials

Not all data has a linear relationship therefore it may be necessary for our model to fit the more complex relationships in the data.

Using the correlation matrix, the top influencing factors that we will use to create polynomials are:

  1. OverallQual
  2. GrLivArea
  3. GarageCars
  4. GarageArea
  5. TotalBsmtSF
  6. 1stFlrSF
  7. FullBath
  8. TotRmsAbvGrd
  9. Fireplaces
  10. MasVnrArea
  11. BsmtFinSF1
  12. LotFrontage
  13. WoodDeckSF
  14. OpenPorchSF
  15. 2ndFlrSF
In [10]:
df = pd.DataFrame([[(i,j),corr.loc[i,j]] for i,j in list(itertools.combinations(corr, 2))],columns=['pairs','corr'])    
print(df.sort_values(by='corr',ascending=False))
                             pairs      corr
600       (GarageCars, GarageArea)  0.886882
441      (GrLivArea, TotRmsAbvGrd)  0.833979
188       (YearBuilt, GarageYrBlt)  0.825192
137       (OverallQual, SalePrice)  0.800858
341        (TotalBsmtSF, 1stFlrSF)  0.800759
455         (GrLivArea, SalePrice)  0.720516
391          (2ndFlrSF, GrLivArea)  0.687430
531   (BedroomAbvGr, TotRmsAbvGrd)  0.679346
267     (BsmtFinSF1, BsmtFullBath)  0.661933
610        (GarageCars, SalePrice)  0.649256
365       (TotalBsmtSF, SalePrice)  0.646584
218    (YearRemodAdd, GarageYrBlt)  0.641445
620        (GarageArea, SalePrice)  0.636964
437          (GrLivArea, FullBath)  0.635161
389          (1stFlrSF, SalePrice)  0.625235
398       (2ndFlrSF, TotRmsAbvGrd)  0.610794
395           (2ndFlrSF, HalfBath)  0.609022
126      (OverallQual, GarageCars)  0.598739
170      (YearBuilt, YearRemodAdd)  0.591906
588      (GarageYrBlt, GarageCars)  0.588347
116       (OverallQual, GrLivArea)  0.583519
106       (OverallQual, YearBuilt)  0.571712
589      (GarageYrBlt, GarageArea)  0.564768
512          (FullBath, SalePrice)  0.559048
127      (OverallQual, GarageArea)  0.554905
107    (OverallQual, YearRemodAdd)  0.550971
498       (FullBath, TotRmsAbvGrd)  0.549625
125     (OverallQual, GarageYrBlt)  0.547320
119        (OverallQual, FullBath)  0.543791
439      (GrLivArea, BedroomAbvGr)  0.540083
..                             ...       ...
1            (MSSubClass, LotArea) -0.142192
372           (1stFlrSF, HalfBath) -0.144373
145        (OverallCond, 1stFlrSF) -0.145613
456   (BsmtFullBath, BsmtHalfBath) -0.146201
663               (MoSold, YrSold) -0.146229
215   (YearRemodAdd, KitchenAbvGr) -0.149288
603    (GarageCars, EnclosedPorch) -0.150590
159      (OverallCond, GarageArea) -0.150679
459   (BsmtFullBath, BedroomAbvGr) -0.152268
185      (YearBuilt, KitchenAbvGr) -0.174481
144     (OverallCond, TotalBsmtSF) -0.176000
392       (2ndFlrSF, BsmtFullBath) -0.178521
264         (BsmtFinSF1, 2ndFlrSF) -0.183358
178      (YearBuilt, LowQualFinSF) -0.183720
122    (OverallQual, KitchenAbvGr) -0.184281
158      (OverallCond, GarageCars) -0.185494
223  (YearRemodAdd, EnclosedPorch) -0.193348
151        (OverallCond, FullBath) -0.194167
288        (BsmtFinSF2, BsmtUnfSF) -0.209286
342        (TotalBsmtSF, 2ndFlrSF) -0.226960
366           (1stFlrSF, 2ndFlrSF) -0.252297
10       (MSSubClass, TotalBsmtSF) -0.255441
11          (MSSubClass, 1stFlrSF) -0.265001
592   (GarageYrBlt, EnclosedPorch) -0.296517
157     (OverallCond, GarageYrBlt) -0.323836
138       (OverallCond, YearBuilt) -0.375691
193     (YearBuilt, EnclosedPorch) -0.386904
0        (MSSubClass, LotFrontage) -0.408655
320      (BsmtUnfSF, BsmtFullBath) -0.424026
261        (BsmtFinSF1, BsmtUnfSF) -0.526140

[666 rows x 2 columns]
In [11]:
# Quadratic
all_data["OverallQual-2"] = all_data["OverallQual"] ** 2
all_data["GrLivArea-2"] = all_data["GrLivArea"] ** 2
all_data["GarageCars-2"] = all_data["GarageCars"] ** 2
all_data["GarageArea-2"] = all_data["GarageArea"] ** 2
all_data["TotalBsmtSF-2"] = all_data["TotalBsmtSF"] ** 2
all_data["1stFlrSF-2"] = all_data["1stFlrSF"] ** 2
all_data["FullBath-2"] = all_data["FullBath"] ** 2
all_data["TotRmsAbvGrd-2"] = all_data["TotRmsAbvGrd"] ** 2
all_data["Fireplaces-2"] = all_data["Fireplaces"] ** 2
all_data["MasVnrArea-2"] = all_data["MasVnrArea"] ** 2
all_data["BsmtFinSF1-2"] = all_data["BsmtFinSF1"] ** 2
all_data["LotFrontage-2"] = all_data["LotFrontage"] ** 2
all_data["WoodDeckSF-2"] = all_data["WoodDeckSF"] ** 2
all_data["OpenPorchSF-2"] = all_data["OpenPorchSF"] ** 2
all_data["2ndFlrSF-2"] = all_data["2ndFlrSF"] ** 2
print("Quadratics done!...")

# Cubic
all_data["OverallQual-3"] = all_data["OverallQual"] ** 3
all_data["GrLivArea-3"] = all_data["GrLivArea"] ** 3
all_data["GarageCars-3"] = all_data["GarageCars"] ** 3
all_data["GarageArea-3"] = all_data["GarageArea"] ** 3
all_data["TotalBsmtSF-3"] = all_data["TotalBsmtSF"] ** 3
all_data["1stFlrSF-3"] = all_data["1stFlrSF"] ** 3
all_data["FullBath-3"] = all_data["FullBath"] ** 3
all_data["TotRmsAbvGrd-3"] = all_data["TotRmsAbvGrd"] ** 3
all_data["Fireplaces-3"] = all_data["Fireplaces"] ** 3
all_data["MasVnrArea-3"] = all_data["MasVnrArea"] ** 3
all_data["BsmtFinSF1-3"] = all_data["BsmtFinSF1"] ** 3
all_data["LotFrontage-3"] = all_data["LotFrontage"] ** 3
all_data["WoodDeckSF-3"] = all_data["WoodDeckSF"] ** 3
all_data["OpenPorchSF-3"] = all_data["OpenPorchSF"] ** 3
all_data["2ndFlrSF-3"] = all_data["2ndFlrSF"] ** 3
print("Cubics done!...")

# Square Root
all_data["OverallQual-Sq"] = np.sqrt(all_data["OverallQual"])
all_data["GrLivArea-Sq"] = np.sqrt(all_data["GrLivArea"])
all_data["GarageCars-Sq"] = np.sqrt(all_data["GarageCars"])
all_data["GarageArea-Sq"] = np.sqrt(all_data["GarageArea"])
all_data["TotalBsmtSF-Sq"] = np.sqrt(all_data["TotalBsmtSF"])
all_data["1stFlrSF-Sq"] = np.sqrt(all_data["1stFlrSF"])
all_data["FullBath-Sq"] = np.sqrt(all_data["FullBath"])
all_data["TotRmsAbvGrd-Sq"] = np.sqrt(all_data["TotRmsAbvGrd"])
all_data["Fireplaces-Sq"] = np.sqrt(all_data["Fireplaces"])
all_data["MasVnrArea-Sq"] = np.sqrt(all_data["MasVnrArea"])
all_data["BsmtFinSF1-Sq"] = np.sqrt(all_data["BsmtFinSF1"])
all_data["LotFrontage-Sq"] = np.sqrt(all_data["LotFrontage"])
all_data["WoodDeckSF-Sq"] = np.sqrt(all_data["WoodDeckSF"])
all_data["OpenPorchSF-Sq"] = np.sqrt(all_data["OpenPorchSF"])
all_data["2ndFlrSF-Sq"] = np.sqrt(all_data["2ndFlrSF"])
print("Roots done!...")
Quadratics done!...
Cubics done!...
Roots done!...

4.2.2 Interior

BsmtQual

  • Evaluates the height of the basement.
In [12]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="BsmtQual", y="SalePrice", data=train, order=['Fa', 'TA', 'Gd', 'Ex']);

plt.subplot(1, 3, 2)
sns.stripplot(x="BsmtQual", y="SalePrice", data=train, size = 5, jitter = True, order=['Fa', 'TA', 'Gd', 'Ex']);

plt.subplot(1, 3, 3)
sns.barplot(x="BsmtQual", y="SalePrice", data=train, order=['Fa', 'TA', 'Gd', 'Ex']);
  • SalePrice is clearly affected by BsmtQual, with the better the quality being meaning the higher the price.
  • However, it looks as though most houses have either 'Good' or 'Typical' sized basements.
  • Since this feature is ordinal, i.e. the categories represent different levels of order, we will replace the values manually
In [13]:
all_data['BsmtQual'] = all_data['BsmtQual'].map({"None":0, "Fa":1, "TA":2, "Gd":3, "Ex":4})
all_data['BsmtQual'].unique()
Out[13]:
array([3, 2, 4, 0, 1], dtype=int64)

BsmtCond

  • Evaluates the general condition of the basement.
In [14]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="BsmtCond", y="SalePrice", data=train, order=['Po', 'Fa', 'TA', 'Gd']);

plt.subplot(1, 3, 2)
sns.stripplot(x="BsmtCond", y="SalePrice", data=train, size = 5, jitter = True, order=['Po', 'Fa', 'TA', 'Gd']);

plt.subplot(1, 3, 3)
sns.barplot(x="BsmtCond", y="SalePrice", data=train, order=['Po', 'Fa', 'TA', 'Gd']);
  • As the condition of the basement improves, the SalePrice also increases.
  • However, we see some very high SalePrice values for the houses with "Typical" basement conditions. This perhaps suggests that although these two features correlate positively, BsmtCond may not have a largely influential contribution on SalePrice.
  • We also see the largest number of houses falling into the "TA" category.
  • Since this feature is ordinal, We will replace the values manually
In [15]:
all_data['BsmtCond'] = all_data['BsmtCond'].map({"None":0, "Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5})
all_data['BsmtCond'].unique()
Out[15]:
array([3, 4, 0, 2, 1], dtype=int64)

BsmtExposure

  • Refers to walkout or garden level walls
In [16]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="BsmtExposure", y="SalePrice", data=train, order=['No', 'Mn', 'Av', 'Gd']);

plt.subplot(1, 3, 2)
sns.stripplot(x="BsmtExposure", y="SalePrice", data=train, size = 5, jitter = True, order=['No', 'Mn', 'Av', 'Gd']);

plt.subplot(1, 3, 3)
sns.barplot(x="BsmtExposure", y="SalePrice", data=train, order=['No', 'Mn', 'Av', 'Gd']);
  • As the amount of exposure increases, so does hte typical SalePrice. Interestingly, the average difference of SalePrice between categories is quite low here, telling us that some houses sold for very high prices, even with no exposure.
  • From this analysis we would say that it is positively correlating with SalePrice, but it isn't massively influential.
  • Since this feature is ordinal, we will replace the values manually.
In [17]:
all_data['BsmtExposure'] = all_data['BsmtExposure'].map({"None":0, "No":1, "Mn":2, "Av":3, "Gd":4})
all_data['BsmtExposure'].unique()
Out[17]:
array([1, 4, 2, 3, 0], dtype=int64)

BsmtFinType1

  • Rating of basement finished area
In [18]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="BsmtFinType1", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"], palette = mycols);

plt.subplot(1, 3, 2)
sns.stripplot(x="BsmtFinType1", y="SalePrice", data=train, size = 5, jitter = True, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"], palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="BsmtFinType1", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"], palette = mycols);
  • This is very interesting, it seems as though houses with an unfinished basement on average sold for more money than houses having up to an average rating...
  • However, houses with a good finish within the basement still demand more money than unfinished ones.
  • This is an ordinal feature, however as you can see this order does not necessarily cause a higher SalePrice. By creating an ordinal variable it will suggest that if the order of the feature increases then the target variable would also. We can see that this is not the case. Therefore, we will create dummy variables from this feature.
In [19]:
all_data = pd.get_dummies(all_data, columns = ["BsmtFinType1"], prefix="BsmtFinType1")
all_data.head(3)
Out[19]:
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... WoodDeckSF-Sq OpenPorchSF-Sq 2ndFlrSF-Sq BsmtFinType1_ALQ BsmtFinType1_BLQ BsmtFinType1_GLQ BsmtFinType1_LwQ BsmtFinType1_None BsmtFinType1_Rec BsmtFinType1_Unf
0 856 854 0 None 3 1Fam 3 1 706.0 0.0 ... 0.000000 7.810250 29.223278 0 0 1 0 0 0 0
1 1262 0 0 None 3 1Fam 3 4 978.0 0.0 ... 17.262677 0.000000 0.000000 1 0 0 0 0 0 0
2 920 866 0 None 3 1Fam 3 2 486.0 0.0 ... 0.000000 6.480741 29.427878 0 0 1 0 0 0 0

3 rows × 129 columns

BsmtFinSF1

  • Type 1 finished square feet
In [20]:
grid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['BsmtFinSF1'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['BsmtFinSF1'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="BsmtFinSF1", data=train, palette = mycols)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="BsmtFinSF1", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="BsmtFinSF1", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="BsmtFinSF1", data=train, palette = mycols);
  • This feature has a positive correlation with SalePrice and the spread of data points is quite large.
  • It is also clear that the local area (Neighborhood) and style of building (BldgType, HouseStyle and LotShape) has a varying effect on this feature.
  • Since this is a continuous numeric feature, we will bin this into several categories and create dummy features.
In [21]:
all_data['BsmtFinSF1_Band'] = pd.cut(all_data['BsmtFinSF1'], 4)
all_data['BsmtFinSF1_Band'].unique()
Out[21]:
[(-4.01, 1002.5], (1002.5, 2005.0], (2005.0, 3007.5], (3007.5, 4010.0]]
Categories (4, interval[float64]): [(-4.01, 1002.5] < (1002.5, 2005.0] < (2005.0, 3007.5] < (3007.5, 4010.0]]
In [22]:
all_data.loc[all_data['BsmtFinSF1']<=1002.5, 'BsmtFinSF1'] = 1
all_data.loc[(all_data['BsmtFinSF1']>1002.5) & (all_data['BsmtFinSF1']<=2005), 'BsmtFinSF1'] = 2
all_data.loc[(all_data['BsmtFinSF1']>2005) & (all_data['BsmtFinSF1']<=3007.5), 'BsmtFinSF1'] = 3
all_data.loc[all_data['BsmtFinSF1']>3007.5, 'BsmtFinSF1'] = 4
all_data['BsmtFinSF1'] = all_data['BsmtFinSF1'].astype(int)

all_data.drop('BsmtFinSF1_Band', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["BsmtFinSF1"], prefix="BsmtFinSF1")
all_data.head(3)
Out[22]:
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF2 BsmtFinType2 ... BsmtFinType1_BLQ BsmtFinType1_GLQ BsmtFinType1_LwQ BsmtFinType1_None BsmtFinType1_Rec BsmtFinType1_Unf BsmtFinSF1_1 BsmtFinSF1_2 BsmtFinSF1_3 BsmtFinSF1_4
0 856 854 0 None 3 1Fam 3 1 0.0 Unf ... 0 1 0 0 0 0 1 0 0 0
1 1262 0 0 None 3 1Fam 3 4 0.0 Unf ... 0 0 0 0 0 0 1 0 0 0
2 920 866 0 None 3 1Fam 3 2 0.0 Unf ... 0 1 0 0 0 0 1 0 0 0

3 rows × 132 columns

BsmtFinType2

  • Rating of basement finished area (if multiple types)
In [23]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="BsmtFinType2", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"], palette = mycols);

plt.subplot(1, 3, 2)
sns.stripplot(x="BsmtFinType2", y="SalePrice", data=train, size = 5, jitter = True, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"], palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="BsmtFinType2", y="SalePrice", data=train, order=["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"], palette = mycols);
  • There seems as though there are a lot of houses with unfinished second basements, and this may cause the skew in terms of SalePrice's being relatively high for these...
  • There also looks to be only a few values for each of the other categories, with the highest average SalePrice coming from the second best category.
  • Although this is intended to be an ordinal feature, we can see that the SalePrice does not necessarily increase with order. Hence, we will cerate dummy variables here.
In [24]:
all_data = pd.get_dummies(all_data, columns = ["BsmtFinType2"], prefix="BsmtFinType2")
all_data.head(3)
Out[24]:
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF2 BsmtFullBath ... BsmtFinSF1_2 BsmtFinSF1_3 BsmtFinSF1_4 BsmtFinType2_ALQ BsmtFinType2_BLQ BsmtFinType2_GLQ BsmtFinType2_LwQ BsmtFinType2_None BsmtFinType2_Rec BsmtFinType2_Unf
0 856 854 0 None 3 1Fam 3 1 0.0 1.0 ... 0 0 0 0 0 0 0 0 0 1
1 1262 0 0 None 3 1Fam 3 4 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 1
2 920 866 0 None 3 1Fam 3 2 0.0 1.0 ... 0 0 0 0 0 0 0 0 0 1

3 rows × 138 columns

BsmtFinSF2

  • Type 2 finished square feet.
In [25]:
grid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['BsmtFinSF2'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['BsmtFinSF2'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="BsmtFinSF2", data=train, palette = mycols)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="BsmtFinSF2", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="BsmtFinSF2", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="BsmtFinSF2", data=train, palette = mycols);
  • There are a large number of data points with this feature = 0. Outside of this, there is no significant correlation with SalePrice and a large spread of values.
  • Hence, we will replace this feature with a flag.
In [26]:
all_data['BsmtFinSf2_Flag'] = all_data['BsmtFinSF2'].map(lambda x:0 if x==0 else 1)
all_data.drop('BsmtFinSF2', axis=1, inplace=True)

BsmtUnfSF

  • Unfinished square feet of basement area
In [27]:
grid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['BsmtUnfSF'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['BsmtUnfSF'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="BsmtUnfSF", data=train, palette = mycols)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="BsmtUnfSF", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="BsmtUnfSF", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="BsmtUnfSF", data=train, palette = mycols);
  • This feature has a significant positive correlation with SalePrice, with a small proportion of data points having a value of 0. This tells me that most houses will have some amount of square feet unfinished within the basement, and this actually positively contributes towards SalePrice.
  • The amount of unfinished square feet also varies widely based on location and style.
  • Whereas the average unfinished square feet within the basement is fairly consistent across the different lot shapes.
  • Since this is a continuous numeric feature with a significant correlation, we will bin this and create dummy variables.
In [28]:
all_data['BsmtUnfSF_Band'] = pd.cut(all_data['BsmtUnfSF'], 3)
all_data['BsmtUnfSF_Band'].unique()
Out[28]:
[(-2.336, 778.667], (778.667, 1557.333], (1557.333, 2336.0]]
Categories (3, interval[float64]): [(-2.336, 778.667] < (778.667, 1557.333] < (1557.333, 2336.0]]
In [29]:
all_data.loc[all_data['BsmtUnfSF']<=778.667, 'BsmtUnfSF'] = 1
all_data.loc[(all_data['BsmtUnfSF']>778.667) & (all_data['BsmtUnfSF']<=1557.333), 'BsmtUnfSF'] = 2
all_data.loc[all_data['BsmtUnfSF']>1557.333, 'BsmtUnfSF'] = 3
all_data['BsmtUnfSF'] = all_data['BsmtUnfSF'].astype(int)

all_data.drop('BsmtUnfSF_Band', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["BsmtUnfSF"], prefix="BsmtUnfSF")
all_data.head(3)
Out[29]:
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFullBath BsmtHalfBath ... BsmtFinType2_BLQ BsmtFinType2_GLQ BsmtFinType2_LwQ BsmtFinType2_None BsmtFinType2_Rec BsmtFinType2_Unf BsmtFinSf2_Flag BsmtUnfSF_1 BsmtUnfSF_2 BsmtUnfSF_3
0 856 854 0 None 3 1Fam 3 1 1.0 0.0 ... 0 0 0 0 0 1 0 1 0 0
1 1262 0 0 None 3 1Fam 3 4 0.0 1.0 ... 0 0 0 0 0 1 0 1 0 0
2 920 866 0 None 3 1Fam 3 2 1.0 0.0 ... 0 0 0 0 0 1 0 1 0 0

3 rows × 140 columns

TotalBsmtSF

  • Total square feet of basement area.
In [30]:
grid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['TotalBsmtSF'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['TotalBsmtSF'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="TotalBsmtSF", data=train, palette = mycols)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="TotalBsmtSF", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="TotalBsmtSF", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="TotalBsmtSF", data=train, palette = mycols);
  • This will be a very important feature within the analysis, due to such a high correlation with Saleprice.
  • We can see that it varies widely based on location, however the average basement size has a lower variance based on type, style and lot shape.
  • Due to this being a continuous numeric feature and also being a very significant feature when describing SalePrice, we believe there could be more value to be mined within this feature. Hence, we will create some binnings and dummy variables.
In [31]:
all_data['TotalBsmtSF_Band'] = pd.cut(all_data['TotalBsmtSF'], 10)
all_data['TotalBsmtSF_Band'].unique()
Out[31]:
[(509.5, 1019.0], (1019.0, 1528.5], (1528.5, 2038.0], (-5.095, 509.5], (2038.0, 2547.5], (3057.0, 3566.5], (2547.5, 3057.0], (4585.5, 5095.0]]
Categories (8, interval[float64]): [(-5.095, 509.5] < (509.5, 1019.0] < (1019.0, 1528.5] < (1528.5, 2038.0] < (2038.0, 2547.5] < (2547.5, 3057.0] < (3057.0, 3566.5] < (4585.5, 5095.0]]
In [32]:
all_data.loc[all_data['TotalBsmtSF']<=509.5, 'TotalBsmtSF'] = 1
all_data.loc[(all_data['TotalBsmtSF']>509.5) & (all_data['TotalBsmtSF']<=1019), 'TotalBsmtSF'] = 2
all_data.loc[(all_data['TotalBsmtSF']>1019) & (all_data['TotalBsmtSF']<=1528.5), 'TotalBsmtSF'] = 3
all_data.loc[(all_data['TotalBsmtSF']>1528.5) & (all_data['TotalBsmtSF']<=2038), 'TotalBsmtSF'] = 4
all_data.loc[(all_data['TotalBsmtSF']>2038) & (all_data['TotalBsmtSF']<=2547.5), 'TotalBsmtSF'] = 5
all_data.loc[(all_data['TotalBsmtSF']>2547.5) & (all_data['TotalBsmtSF']<=3057), 'TotalBsmtSF'] = 6
all_data.loc[(all_data['TotalBsmtSF']>3057) & (all_data['TotalBsmtSF']<=3566.5), 'TotalBsmtSF'] = 7
all_data.loc[all_data['TotalBsmtSF']>3566.5, 'TotalBsmtSF'] = 8
all_data['TotalBsmtSF'] = all_data['TotalBsmtSF'].astype(int)

all_data.drop('TotalBsmtSF_Band', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["TotalBsmtSF"], prefix="TotalBsmtSF")
all_data.head(3)
Out[32]:
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFullBath BsmtHalfBath ... BsmtUnfSF_2 BsmtUnfSF_3 TotalBsmtSF_1 TotalBsmtSF_2 TotalBsmtSF_3 TotalBsmtSF_4 TotalBsmtSF_5 TotalBsmtSF_6 TotalBsmtSF_7 TotalBsmtSF_8
0 856 854 0 None 3 1Fam 3 1 1.0 0.0 ... 0 0 0 1 0 0 0 0 0 0
1 1262 0 0 None 3 1Fam 3 4 0.0 1.0 ... 0 0 0 0 1 0 0 0 0 0
2 920 866 0 None 3 1Fam 3 2 1.0 0.0 ... 0 0 0 1 0 0 0 0 0 0

3 rows × 147 columns

1stFlrSF

  • First floor square feet.
In [33]:
grid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['1stFlrSF'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['1stFlrSF'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="1stFlrSF", data=train, palette = mycols)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="1stFlrSF", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="1stFlrSF", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="1stFlrSF", data=train, palette = mycols);
  • Clearly this shows a very high positive correlation with SalePrice, this will be an important feature during modeling.
  • Once again, this feature varies greatly across neighborhoods and the size of this feature varies across building types and styles.
  • This feature does not vary so much across the lot size.
  • Since this is a continuous numeric feature, we will bin this feature and create dummy variables.
In [34]:
all_data['1stFlrSF_Band'] = pd.cut(all_data['1stFlrSF'], 6)
all_data['1stFlrSF_Band'].unique()
Out[34]:
[(329.239, 1127.5], (1127.5, 1921.0], (1921.0, 2714.5], (2714.5, 3508.0], (3508.0, 4301.5], (4301.5, 5095.0]]
Categories (6, interval[float64]): [(329.239, 1127.5] < (1127.5, 1921.0] < (1921.0, 2714.5] < (2714.5, 3508.0] < (3508.0, 4301.5] < (4301.5, 5095.0]]
In [35]:
all_data.loc[all_data['1stFlrSF']<=1127.5, '1stFlrSF'] = 1
all_data.loc[(all_data['1stFlrSF']>1127.5) & (all_data['1stFlrSF']<=1921), '1stFlrSF'] = 2
all_data.loc[(all_data['1stFlrSF']>1921) & (all_data['1stFlrSF']<=2714.5), '1stFlrSF'] = 3
all_data.loc[(all_data['1stFlrSF']>2714.5) & (all_data['1stFlrSF']<=3508), '1stFlrSF'] = 4
all_data.loc[(all_data['1stFlrSF']>3508) & (all_data['1stFlrSF']<=4301.5), '1stFlrSF'] = 5
all_data.loc[all_data['1stFlrSF']>4301.5, '1stFlrSF'] = 6
all_data['1stFlrSF'] = all_data['1stFlrSF'].astype(int)

all_data.drop('1stFlrSF_Band', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["1stFlrSF"], prefix="1stFlrSF")
all_data.head(3)
Out[35]:
2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFullBath BsmtHalfBath BsmtQual ... TotalBsmtSF_5 TotalBsmtSF_6 TotalBsmtSF_7 TotalBsmtSF_8 1stFlrSF_1 1stFlrSF_2 1stFlrSF_3 1stFlrSF_4 1stFlrSF_5 1stFlrSF_6
0 854 0 None 3 1Fam 3 1 1.0 0.0 3 ... 0 0 0 0 1 0 0 0 0 0
1 0 0 None 3 1Fam 3 4 0.0 1.0 3 ... 0 0 0 0 0 1 0 0 0 0
2 866 0 None 3 1Fam 3 2 1.0 0.0 3 ... 0 0 0 0 1 0 0 0 0 0

3 rows × 152 columns

2ndFlrSF

  • Second floor square feet.
In [36]:
grid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['2ndFlrSF'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['2ndFlrSF'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="2ndFlrSF", data=train, palette = mycols)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="2ndFlrSF", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="2ndFlrSF", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="2ndFlrSF", data=train, palette = mycols);
  • Interestingly we see a highly positively correlated relationship with SalePrice, however we also see a significant number of houses with value = 0.
  • This is explained with the other visuals, showing that some styles of houses perhaps do not have a second floor, hence cannot have a value for this feature - such as "1Story" houses.
  • We also see a high dependance and variation between neighborhoods, building types and lot sizes.
  • It is evident that all the variables related to "space" are important in this analysis. Since this feature is a continuous numeric feature, we will bin this and create dummy variables.
In [37]:
all_data['2ndFlrSF_Band'] = pd.cut(all_data['2ndFlrSF'], 6)
all_data['2ndFlrSF_Band'].unique()
Out[37]:
[(620.667, 931.0], (-1.862, 310.333], (931.0, 1241.333], (310.333, 620.667], (1241.333, 1551.667], (1551.667, 1862.0]]
Categories (6, interval[float64]): [(-1.862, 310.333] < (310.333, 620.667] < (620.667, 931.0] < (931.0, 1241.333] < (1241.333, 1551.667] < (1551.667, 1862.0]]
In [38]:
all_data.loc[all_data['2ndFlrSF']<=310.333, '2ndFlrSF'] = 1
all_data.loc[(all_data['2ndFlrSF']>310.333) & (all_data['2ndFlrSF']<=620.667), '2ndFlrSF'] = 2
all_data.loc[(all_data['2ndFlrSF']>620.667) & (all_data['2ndFlrSF']<=931), '2ndFlrSF'] = 3
all_data.loc[(all_data['2ndFlrSF']>931) & (all_data['2ndFlrSF']<=1241.333), '2ndFlrSF'] = 4
all_data.loc[(all_data['2ndFlrSF']>1241.333) & (all_data['2ndFlrSF']<=1551.667), '2ndFlrSF'] = 5
all_data.loc[all_data['2ndFlrSF']>1551.667, '2ndFlrSF'] = 6
all_data['2ndFlrSF'] = all_data['2ndFlrSF'].astype(int)

all_data.drop('2ndFlrSF_Band', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["2ndFlrSF"], prefix="2ndFlrSF")
all_data.head(3)
Out[38]:
3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFullBath BsmtHalfBath BsmtQual CentralAir ... 1stFlrSF_3 1stFlrSF_4 1stFlrSF_5 1stFlrSF_6 2ndFlrSF_1 2ndFlrSF_2 2ndFlrSF_3 2ndFlrSF_4 2ndFlrSF_5 2ndFlrSF_6
0 0 None 3 1Fam 3 1 1.0 0.0 3 Y ... 0 0 0 0 0 0 1 0 0 0
1 0 None 3 1Fam 3 4 0.0 1.0 3 Y ... 0 0 0 0 1 0 0 0 0 0
2 0 None 3 1Fam 3 2 1.0 0.0 3 Y ... 0 0 0 0 0 0 1 0 0 0

3 rows × 157 columns

LowQualFinSF

  • Low quality finished square feet (all floors)
In [39]:
grid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['LowQualFinSF'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['LowQualFinSF'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="LowQualFinSF", data=train, palette = mycols)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="LowQualFinSF", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="LowQualFinSF", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="LowQualFinSF", data=train, palette = mycols);
  • We can see that there is a large number of properties with a value of 0 for this feature. Clearly, it does not have a significant correlation with SalePrice.
  • For this reason, we will replace this feature with a flag.

BsmtHalfBath, BsmtFullBath, HalfBath, FullBath

  • Number of bathrooms. For this feature, it made sense to sum them all together and create a total bathrooms feature.
In [40]:
all_data['TotalBathrooms'] = all_data['BsmtHalfBath'] + all_data['BsmtFullBath'] + all_data['HalfBath'] + all_data['FullBath']

columns = ['BsmtHalfBath', 'BsmtFullBath', 'HalfBath', 'FullBath']
all_data.drop(columns, axis=1, inplace=True)

Bedroom

  • Bedrooms above grade (does not include basement bedrooms)
In [41]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="BedroomAbvGr", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="BedroomAbvGr", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="BedroomAbvGr", y="SalePrice", data=train, palette = mycols);
  • We see a lot of houses with 2 3 and 4 bedrooms above ground, and a very low number of houses with 6 or above.
  • Since this is a continuous numeric feature, we will leave it how it is.

Kitchen

  • Kitchens above grade.
In [42]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="KitchenAbvGr", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="KitchenAbvGr", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="KitchenAbvGr", y="SalePrice", data=train, palette = mycols);
  • Similarly to last previous feature, we see just a small number of houses with a large number of kitchens above grade. This shows that most houses have 1 kitchen above grade.
  • Since this is a continuous numeric feature, we will leave it as it is.

KitchenQual

  • Kitchen quality.
In [43]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="KitchenQual", y="SalePrice", data=train, order=["Fa", "TA", "Gd", "Ex"], palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="KitchenQual", y="SalePrice", data=train, size = 5, jitter = True, order=["Fa", "TA", "Gd", "Ex"], palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="KitchenQual", y="SalePrice", data=train, order=["Fa", "TA", "Gd", "Ex"], palette = mycols);
  • There is a clear positive correlation with the SalePrice and the quality of the kitchen.
  • There is one value for "Gd" that has an extremely high SalePrice however.
  • For this feature, since it is categorical with an order, we will replace these values manually.
In [44]:
all_data['KitchenQual'] = all_data['KitchenQual'].map({"Fa":1, "TA":2, "Gd":3, "Ex":4})
all_data['KitchenQual'].unique()
Out[44]:
array([3, 2, 4, 1], dtype=int64)

TotRmsAbvGrd

  • Total rooms above grade (does not include bathrooms)
In [45]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="TotRmsAbvGrd", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="TotRmsAbvGrd", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="TotRmsAbvGrd", y="SalePrice", data=train, palette = mycols);
  • Generally we see a positive correlation, as the number of rooms increases, so does the SalePrice.
  • However due to low frequency, we do see some unreliable results for the very large and small values for this feature.
  • Since this is a continuous numeric feature, we will leave it as it is.

Fireplaces

  • Number of fireplaces
In [46]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="Fireplaces", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="Fireplaces", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="Fireplaces", y="SalePrice", data=train, palette = mycols);
  • We have a positive correlation with SalePrice, with most houses having just 1 or 0 fireplaces.
  • We will leave this feature as it is.

FireplaceQu

  • Fireplace quality.
In [47]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="FireplaceQu", y="SalePrice", data=train, order=["Po", "Fa", "TA", "Gd", "Ex"], palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="FireplaceQu", y="SalePrice", data=train, size = 5, jitter = True, order=["Po", "Fa", "TA", "Gd", "Ex"], palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="FireplaceQu", y="SalePrice", data=train, order=["Po", "Fa", "TA", "Gd", "Ex"], palette = mycols);
  • We also see a positive correlation and the fireplace quality increases. Most houses have either "TA" or "Gd" quality fireplaces.
  • Since this is a categorical feature with order, we will replace the values manually.
In [48]:
all_data['FireplaceQu'] = all_data['FireplaceQu'].map({"None":0, "Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5})
all_data['FireplaceQu'].unique()
Out[48]:
array([0, 3, 4, 2, 5, 1], dtype=int64)

GrLivArea

  • Above grade ground living area in square feet.
In [49]:
grid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['GrLivArea'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['GrLivArea'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="GrLivArea", data=train, palette = mycols)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="GrLivArea", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="GrLivArea", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="GrLivArea", data=train, palette = mycols);
  • We see a very high positive correlation with SalePrice.
  • We also see the values varying very highly between styles of houses and neigborhood.
  • Since this will be an important feature in our modeling, we will create bins and dummy features.
In [50]:
all_data['GrLivArea_Band'] = pd.cut(all_data['GrLivArea'], 6)
all_data['GrLivArea_Band'].unique()
Out[50]:
[(1127.5, 1921.0], (1921.0, 2714.5], (329.239, 1127.5], (2714.5, 3508.0], (3508.0, 4301.5], (4301.5, 5095.0]]
Categories (6, interval[float64]): [(329.239, 1127.5] < (1127.5, 1921.0] < (1921.0, 2714.5] < (2714.5, 3508.0] < (3508.0, 4301.5] < (4301.5, 5095.0]]
In [51]:
all_data.loc[all_data['GrLivArea']<=1127.5, 'GrLivArea'] = 1
all_data.loc[(all_data['GrLivArea']>1127.5) & (all_data['GrLivArea']<=1921), 'GrLivArea'] = 2
all_data.loc[(all_data['GrLivArea']>1921) & (all_data['GrLivArea']<=2714.5), 'GrLivArea'] = 3
all_data.loc[(all_data['GrLivArea']>2714.5) & (all_data['GrLivArea']<=3508), 'GrLivArea'] = 4
all_data.loc[(all_data['GrLivArea']>3508) & (all_data['GrLivArea']<=4301.5), 'GrLivArea'] = 5
all_data.loc[all_data['GrLivArea']>4301.5, 'GrLivArea'] = 6
all_data['GrLivArea'] = all_data['GrLivArea'].astype(int)

all_data.drop('GrLivArea_Band', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["GrLivArea"], prefix="GrLivArea")
all_data.head(3)
Out[51]:
3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 ... 2ndFlrSF_4 2ndFlrSF_5 2ndFlrSF_6 TotalBathrooms GrLivArea_1 GrLivArea_2 GrLivArea_3 GrLivArea_4 GrLivArea_5 GrLivArea_6
0 0 None 3 1Fam 3 1 3 Y Norm Norm ... 0 0 0 4.0 0 1 0 0 0 0
1 0 None 3 1Fam 3 4 3 Y Feedr Norm ... 0 0 0 3.0 0 1 0 0 0 0
2 0 None 3 1Fam 3 2 3 Y Norm Norm ... 0 0 0 4.0 0 1 0 0 0 0

3 rows × 159 columns

4.2.3 Architectural & Structural

MSSubClass

  • Identifies the type of dwelling involved in the sale.
In [52]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="MSSubClass", y="SalePrice", data=train, palette = mycols);

plt.subplot(1, 3, 2)
sns.stripplot(x="MSSubClass", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="MSSubClass", y="SalePrice", data=train, palette = mycols);
  • Each of these classes represents a very different style of building, as shown in the data description. Hence, we can see large variance between classes with SalePrice.
  • This is a numeric feature, but it should actually be categorical. We could cluster some of these categories together, but for now we will create a dummy feature for each category.
In [53]:
all_data['MSSubClass'] = all_data['MSSubClass'].astype(str)

all_data = pd.get_dummies(all_data, columns = ["MSSubClass"], prefix="MSSubClass")
all_data.head(3)
Out[53]:
3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 ... MSSubClass_30 MSSubClass_40 MSSubClass_45 MSSubClass_50 MSSubClass_60 MSSubClass_70 MSSubClass_75 MSSubClass_80 MSSubClass_85 MSSubClass_90
0 0 None 3 1Fam 3 1 3 Y Norm Norm ... 0 0 0 0 1 0 0 0 0 0
1 0 None 3 1Fam 3 4 3 Y Feedr Norm ... 0 0 0 0 0 0 0 0 0 0
2 0 None 3 1Fam 3 2 3 Y Norm Norm ... 0 0 0 0 1 0 0 0 0 0

3 rows × 174 columns

BldgType

  • Type of dwelling.
In [54]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="BldgType", y="SalePrice", data=train, palette = mycols);

plt.subplot(1, 3, 2)
sns.stripplot(x="BldgType", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="BldgType", y="SalePrice", data=train, palette = mycols);
  • The different categories exhibit a range of average SalePrice's. The class with the most observations is "1Fam".
  • We can also see that the variance within classes is quite tight, with only a few extreme values in each case.
  • There could be a possibility to cluster these classes, however for now we are going to create dummy features.
In [55]:
all_data['BldgType'] = all_data['BldgType'].astype(str)

all_data = pd.get_dummies(all_data, columns = ["BldgType"], prefix="BldgType")
all_data.head(3)
Out[55]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... MSSubClass_70 MSSubClass_75 MSSubClass_80 MSSubClass_85 MSSubClass_90 BldgType_1Fam BldgType_2fmCon BldgType_Duplex BldgType_Twnhs BldgType_TwnhsE
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 0 0 1 0 0 0 0
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 0 1 0 0 0 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 0 0 1 0 0 0 0

3 rows × 178 columns

HouseStyle

  • Style of dwelling.
In [56]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="HouseStyle", y="SalePrice", data=train, palette = mycols);

plt.subplot(1, 3, 2)
sns.stripplot(x="HouseStyle", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="HouseStyle", y="SalePrice", data=train, palette = mycols);
  • Here we see quite a few extreme values across the categories and a large weighting of observations towards the integer story houses.
  • Although the highest average SalePrice comes from "2.5Fin", this has a very high standard deviation and therefore more reliably, the "2Story" houses are also very highly priced on average.
  • Since there are some categories with very few values, we will cluster these into another category and create dummy variables.
In [57]:
all_data['HouseStyle'] = all_data['HouseStyle'].map({"2Story":"2Story", "1Story":"1Story", "1.5Fin":"1.5Story", "1.5Unf":"1.5Story", 
                                                     "SFoyer":"SFoyer", "SLvl":"SLvl", "2.5Unf":"2.5Story", "2.5Fin":"2.5Story"})

all_data = pd.get_dummies(all_data, columns = ["HouseStyle"], prefix="HouseStyle")
all_data.head(3)
Out[57]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... BldgType_2fmCon BldgType_Duplex BldgType_Twnhs BldgType_TwnhsE HouseStyle_1.5Story HouseStyle_1Story HouseStyle_2.5Story HouseStyle_2Story HouseStyle_SFoyer HouseStyle_SLvl
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 0 0 0 0 1 0 0
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 0 1 0 0 0 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 0 0 0 0 1 0 0

3 rows × 183 columns

OverallQual

  • Rates the overall material and finish of the house.
In [58]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="OverallQual", y="SalePrice", data=train, palette = mycols);

plt.subplot(1, 3, 2)
sns.stripplot(x="OverallQual", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="OverallQual", y="SalePrice", data=train, palette = mycols);
  • This feature although being numeric is actually categoric and ordinal, as the value increases so does the SalePrice. Hence, we will keep it as a numeric feature.
  • We see here a nice positive correlation with the increase in OverallQual and the SalePrice, as you'd expect.

OverallCond

  • Rates the overall condition of the house.
In [59]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="OverallCond", y="SalePrice", data=train, palette = mycols);

plt.subplot(1, 3, 2)
sns.stripplot(x="OverallCond", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="OverallCond", y="SalePrice", data=train, palette = mycols);
  • Interestingly, we see here that it does follow a positive correlation with SalePrice, however we see a peak at a value of 5, along with a high number of observations at this value.
  • The highest average SalePrice actually comes from a value of 5 as opposed to 10, which may be a reasonable assumption.
  • For this feature, we will leave it as being numeric and ordinal.

YearRemodAdd

  • Remodel date (same as construction date if no remodeling or additions).
In [60]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="YearRemodAdd", y="SalePrice", data=train, palette = mycols);

plt.subplot(1, 3, 2)
sns.stripplot(x="YearRemodAdd", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="YearRemodAdd", y="SalePrice", data=train, palette = mycols);
  • Here we can see that the newer the remodelling of a house, the higher the SalePrice.
  • From the data description, we believe that creating a new feature describing the difference in number of years between remodeling and construction may be a good choice.
In [61]:
train['Remod_Diff'] = train['YearRemodAdd'] - train['YearBuilt']

plt.subplots(figsize =(40, 10))
sns.barplot(x="Remod_Diff", y="SalePrice", data=train, palette = mycols);
  • Clearly we can see that there are some values which have a much higher SalePrice than others. We will leave this feature as it is, without any binnings.
In [62]:
all_data['Remod_Diff'] = all_data['YearRemodAdd'] - all_data['YearBuilt']

all_data.drop('YearRemodAdd', axis=1, inplace=True)

YearBuilt

  • Original construction date.
In [63]:
plt.subplots(figsize =(50, 10))

sns.barplot(x="YearBuilt", y="SalePrice", data=train, palette = mycols);
  • Here we can see a fairly consistent upward trend for the SalePrice as houses are more modern.
  • For this feature, we are going to create bins and dummy features
In [64]:
all_data['YearBuilt_Band'] = pd.cut(all_data['YearBuilt'], 7)
all_data['YearBuilt_Band'].unique()
Out[64]:
[(1990.286, 2010.0], (1970.571, 1990.286], (1911.429, 1931.143], (1931.143, 1950.857], (1950.857, 1970.571], (1891.714, 1911.429], (1871.862, 1891.714]]
Categories (7, interval[float64]): [(1871.862, 1891.714] < (1891.714, 1911.429] < (1911.429, 1931.143] < (1931.143, 1950.857] < (1950.857, 1970.571] < (1970.571, 1990.286] < (1990.286, 2010.0]]
In [65]:
all_data.loc[all_data['YearBuilt']<=1892, 'YearBuilt'] = 1
all_data.loc[(all_data['YearBuilt']>1892) & (all_data['YearBuilt']<=1911), 'YearBuilt'] = 2
all_data.loc[(all_data['YearBuilt']>1911) & (all_data['YearBuilt']<=1931), 'YearBuilt'] = 3
all_data.loc[(all_data['YearBuilt']>1931) & (all_data['YearBuilt']<=1951), 'YearBuilt'] = 4
all_data.loc[(all_data['YearBuilt']>1951) & (all_data['YearBuilt']<=1971), 'YearBuilt'] = 5
all_data.loc[(all_data['YearBuilt']>1971) & (all_data['YearBuilt']<=1990), 'YearBuilt'] = 6
all_data.loc[all_data['YearBuilt']>1990, 'YearBuilt'] = 7
all_data['YearBuilt'] = all_data['YearBuilt'].astype(int)

all_data.drop('YearBuilt_Band', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["YearBuilt"], prefix="YearBuilt")
all_data.head(3)
Out[65]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... HouseStyle_SFoyer HouseStyle_SLvl Remod_Diff YearBuilt_1 YearBuilt_2 YearBuilt_3 YearBuilt_4 YearBuilt_5 YearBuilt_6 YearBuilt_7
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 0 0 0 0 0 0 1
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 0 0 0 0 1 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 1 0 0 0 0 0 0 1

3 rows × 189 columns

Foundation

  • Type of foundation.
In [66]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="Foundation", y="SalePrice", data=train, palette = mycols);

plt.subplot(1, 3, 2)
sns.stripplot(x="Foundation", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="Foundation", y="SalePrice", data=train, palette = mycols);
  • We have 3 classes with high frequency, however we have 3 of low frequency.
  • Due to the large difference in median and mean SalePrice's across the 3 lower frequent classes, we not going to cluster these together.
  • Also since this feature is not ordinal, labelling does not make sense. Instead we will create dummy variables.
In [67]:
all_data = pd.get_dummies(all_data, columns = ["Foundation"], prefix="Foundation")
all_data.head(3)
Out[67]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... YearBuilt_4 YearBuilt_5 YearBuilt_6 YearBuilt_7 Foundation_BrkTil Foundation_CBlock Foundation_PConc Foundation_Slab Foundation_Stone Foundation_Wood
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 1 0 0 1 0 0 0
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 1 0 0 1 0 0 0 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 1 0 0 1 0 0 0

3 rows × 194 columns

Functional

  • Home functionality.
In [68]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="Functional", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="Functional", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="Functional", y="SalePrice", data=train, palette = mycols);
  • This categorical feature shows that most houses have "Typ" functionality, and looking at the data description leads us to believe that there is an order within these categories, "Typ" being of the highest order.
  • Therefore, we will replace the values of this feature manually with numbers.
In [69]:
all_data['Functional'] = all_data['Functional'].map({"Sev":1, "Maj2":2, "Maj1":3, "Mod":4, "Min2":5, "Min1":6, "Typ":7})
all_data['Functional'].unique()
Out[69]:
array([7, 6, 3, 5, 4, 2, 1], dtype=int64)

4.2.4 Exterior

RoofStyle

  • Type of roof.
In [70]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="RoofStyle", y="SalePrice", data=train, palette = mycols);

plt.subplot(1, 3, 2)
sns.stripplot(x="RoofStyle", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="RoofStyle", y="SalePrice", data=train, palette = mycols);
  • This feature has two highly frequent categories but the values of SalePrice differ between each.
  • Since this is a categorical feature without order, we will create dummy variables.
In [71]:
all_data = pd.get_dummies(all_data, columns = ["RoofStyle"], prefix="RoofStyle")
all_data.head(3)
Out[71]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... Foundation_PConc Foundation_Slab Foundation_Stone Foundation_Wood RoofStyle_Flat RoofStyle_Gable RoofStyle_Gambrel RoofStyle_Hip RoofStyle_Mansard RoofStyle_Shed
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 1 0 0 0 0 1 0 0 0 0
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 0 1 0 0 0 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 1 0 0 0 0 1 0 0 0 0

3 rows × 199 columns

RoofMatl

  • Roof material.
In [72]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="RoofMatl", y="SalePrice", data=train, palette = mycols);

plt.subplot(1, 3, 2)
sns.stripplot(x="RoofMatl", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="RoofMatl", y="SalePrice", data=train, palette = mycols);
  • Interestingly, there are very few observations in the training data for several classes. However, these will be dropped during feature reduction if they turn out to be insignificant.
  • Hence, we will create dummy variables.
In [73]:
all_data = pd.get_dummies(all_data, columns = ["RoofMatl"], prefix="RoofMatl")
all_data.head(3)
Out[73]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... RoofStyle_Hip RoofStyle_Mansard RoofStyle_Shed RoofMatl_CompShg RoofMatl_Membran RoofMatl_Metal RoofMatl_Roll RoofMatl_Tar&Grv RoofMatl_WdShake RoofMatl_WdShngl
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 1 0 0 0 0 0 0
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 1 0 0 0 0 0 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 1 0 0 0 0 0 0

3 rows × 205 columns

Exterior1st & Exterior2nd

  • Exterior covering on house.
In [74]:
plt.subplots(figsize =(35, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="Exterior1st", y="SalePrice", data=train, palette = mycols);

plt.subplot(1, 3, 2)
sns.stripplot(x="Exterior1st", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="Exterior1st", y="SalePrice", data=train, palette = mycols);

plt.subplots(figsize =(35, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="Exterior2nd", y="SalePrice", data=train, palette = mycols);

plt.subplot(1, 3, 2)
sns.stripplot(x="Exterior2nd", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="Exterior2nd", y="SalePrice", data=train, palette = mycols);
  • Looking at these 2 features together, we can see that they exhibit very similar behaviours against SalePrice. This tells us that they are very closely related.
  • Hence, we will create a flag to indicate whether there is a different 2nd exterior covering to the first.
  • Then we will keep "Exterior1st" and create dummy variables from this.
In [75]:
def Exter2(col):
    if col['Exterior2nd'] == col['Exterior1st']:
        return 1
    else:
        return 0
    
all_data['ExteriorMatch_Flag'] = all_data.apply(Exter2, axis=1)
all_data.drop('Exterior2nd', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["Exterior1st"], prefix="Exterior1st")
all_data.head(3)
Out[75]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... Exterior1st_CemntBd Exterior1st_HdBoard Exterior1st_ImStucc Exterior1st_MetalSd Exterior1st_Plywood Exterior1st_Stone Exterior1st_Stucco Exterior1st_VinylSd Exterior1st_Wd Sdng Exterior1st_WdShing
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 0 0 0 0 1 0 0
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 1 0 0 0 0 0 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 0 0 0 0 1 0 0

3 rows × 219 columns

MasVnrType

  • Masonry veneer type.
In [76]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="MasVnrType", y="SalePrice", data=train, palette = mycols);

plt.subplot(1, 3, 2)
sns.stripplot(x="MasVnrType", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="MasVnrType", y="SalePrice", data=train, palette = mycols);
  • Each class has quite a unique range of values for SalePrice, the only class that stands out is "BrkCmn", which has a low frequency.
  • Clearly "Stone" demands the highest SalePrice on average, although there are some extreme values within "BrkFace".
  • Since this is a categorical feature without order, we will create dummy variables here.
In [77]:
all_data = pd.get_dummies(all_data, columns = ["MasVnrType"], prefix="MasVnrType")
all_data.head(3)
Out[77]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... Exterior1st_Plywood Exterior1st_Stone Exterior1st_Stucco Exterior1st_VinylSd Exterior1st_Wd Sdng Exterior1st_WdShing MasVnrType_BrkCmn MasVnrType_BrkFace MasVnrType_None MasVnrType_Stone
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 1 0 0 0 1 0 0
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 0 0 0 0 1 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 1 0 0 0 1 0 0

3 rows × 222 columns

MasVnrArea

  • Masonry veneer area in square feet.
In [78]:
grid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['MasVnrArea'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['MasVnrArea'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="MasVnrArea", data=train, palette = mycols)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="MasVnrArea", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="MasVnrArea", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="MasVnrArea", data=train, palette = mycols);
C:\Users\Derrick\AppData\Roaming\Python\Python36\site-packages\scipy\stats\stats.py:5240: RuntimeWarning: invalid value encountered in less
  x = np.where(x < 1.0, x, 1.0)  # if x > 1 then return 1.0
  • From this we can see that this feature has negligible correlation with SalePrice, and the values for this feature vary widely based on house type, style and size.
  • Since this feature is insignificant in regards to SalePrice, and it also correlates highly with "MasVnrType" (if "MasVnrType = "None" then it has to be equal to 0), we will drop this feature.
In [79]:
all_data.drop('MasVnrArea', axis=1, inplace=True)

ExterQual

  • Evaluates the quality of the material on the exterior.
In [80]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="ExterQual", y="SalePrice", data=train, order=['Fa','TA','Gd', 'Ex'], palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="ExterQual", y="SalePrice", data=train, size = 5, jitter = True, order=['Fa','TA','Gd', 'Ex'], palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="ExterQual", y="SalePrice", data=train, order=['Fa','TA','Gd', 'Ex'], palette = mycols);
  • We can see here that this feature shows a clear order and has a positive correlation with SalePrice. As the quality increases, so does the SalePrice.
  • We see the largest number of observations within the two middle classes, and the lowest observations within the lowest class.
  • Since this is a categorical feature with order, we will replace these values manually.
In [81]:
all_data['ExterQual'] = all_data['ExterQual'].map({"Fa":1, "TA":2, "Gd":3, "Ex":4})
all_data['ExterQual'].unique()
Out[81]:
array([3, 2, 4, 1], dtype=int64)

ExterCond

  • Evaluates the present condition of the material on the exterior.
In [82]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="ExterCond", y="SalePrice", data=train, order=['Po','Fa','TA','Gd', 'Ex'], palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="ExterCond", y="SalePrice", data=train, size = 5, jitter = True, order=['Po','Fa','TA','Gd', 'Ex'], palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="ExterCond", y="SalePrice", data=train, order=['Po','Fa','TA','Gd', 'Ex'], palette = mycols);
  • Interestingly we see the largest values of SalePrice for the second and third best classes. This is perhaps because of the large frequency of values within these classes, whereas we only see 3 observations within "Ex" from the training data.
  • Since this categorical feature has an order, but the SalePrice does not necessarily correlate with this order... we will create dummy variables.
In [83]:
all_data = pd.get_dummies(all_data, columns = ["ExterCond"], prefix="ExterCond")
all_data.head(3)
Out[83]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... Exterior1st_WdShing MasVnrType_BrkCmn MasVnrType_BrkFace MasVnrType_None MasVnrType_Stone ExterCond_Ex ExterCond_Fa ExterCond_Gd ExterCond_Po ExterCond_TA
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 1 0 0 0 0 0 0 1
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 1 0 0 0 0 0 1
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 1 0 0 0 0 0 0 1

3 rows × 225 columns

GarageType

  • Garage location.
In [84]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="GarageType", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="GarageType", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="GarageType", y="SalePrice", data=train, palette = mycols);
  • Here we see "BuiltIn" and "Attched" having the 2 highest average SalePrices, with only a few extreme values within each class.
  • Since this is categorical without order, we will create dummy variables.
In [85]:
all_data = pd.get_dummies(all_data, columns = ["GarageType"], prefix="GarageType")
all_data.head(3)
Out[85]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... ExterCond_Gd ExterCond_Po ExterCond_TA GarageType_2Types GarageType_Attchd GarageType_Basment GarageType_BuiltIn GarageType_CarPort GarageType_Detchd GarageType_None
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 1 0 1 0 0 0 0 0
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 1 0 1 0 0 0 0 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 1 0 1 0 0 0 0 0

3 rows × 231 columns

GarageYrBlt

  • Year garage was built.
In [86]:
plt.subplots(figsize =(50, 10))

sns.boxplot(x="GarageYrBlt", y="SalePrice", data=train, palette = mycols);
  • We can see a slight upward trend as the garage building year becomes more modern.
  • For this feature we are going to create bins and the dummy variables.
In [87]:
all_data['GarageYrBlt_Band'] = pd.qcut(all_data['GarageYrBlt'], 3)
all_data['GarageYrBlt_Band'].unique()
Out[87]:
[(1996.0, 2207.0], (1964.0, 1996.0], (-0.001, 1964.0]]
Categories (3, interval[float64]): [(-0.001, 1964.0] < (1964.0, 1996.0] < (1996.0, 2207.0]]
In [88]:
all_data.loc[all_data['GarageYrBlt']<=1964, 'GarageYrBlt'] = 1
all_data.loc[(all_data['GarageYrBlt']>1964) & (all_data['GarageYrBlt']<=1996), 'GarageYrBlt'] = 2
all_data.loc[all_data['GarageYrBlt']>1996, 'GarageYrBlt'] = 3
all_data['GarageYrBlt'] = all_data['GarageYrBlt'].astype(int)

all_data.drop('GarageYrBlt_Band', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["GarageYrBlt"], prefix="GarageYrBlt")
all_data.head(3)
Out[88]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... GarageType_2Types GarageType_Attchd GarageType_Basment GarageType_BuiltIn GarageType_CarPort GarageType_Detchd GarageType_None GarageYrBlt_1 GarageYrBlt_2 GarageYrBlt_3
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 1 0 0 0 0 0 0 0 1
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 1 0 0 0 0 0 0 1 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 1 0 0 0 0 0 0 0 1

3 rows × 233 columns

GarageFinish

  • Interior finish of the garage.
In [89]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="GarageFinish", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="GarageFinish", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="GarageFinish", y="SalePrice", data=train, palette = mycols);
  • Here we see a nice split between the 3 classes, with "Fin" producing the highest SalePrice's on average.
  • We will create dummy variables for this feature.
In [90]:
all_data = pd.get_dummies(all_data, columns = ["GarageFinish"], prefix="GarageFinish")
all_data.head(3)
Out[90]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... GarageType_CarPort GarageType_Detchd GarageType_None GarageYrBlt_1 GarageYrBlt_2 GarageYrBlt_3 GarageFinish_Fin GarageFinish_None GarageFinish_RFn GarageFinish_Unf
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 0 0 1 0 0 1 0
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 1 0 0 0 1 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 0 0 1 0 0 1 0

3 rows × 236 columns

GarageCars

  • Size of the garage in car capacity.
In [91]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="GarageCars", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="GarageCars", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="GarageCars", y="SalePrice", data=train, palette = mycols);
  • We generally see a positive correlation with an increasing garage car capacity. However, we see a slight dip for 4 cars we believe due to the low frequency of houses with a 4 car garage.

GarageArea

  • Size of the garage in square feet.
In [92]:
grid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['GarageArea'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['GarageArea'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="GarageArea", data=train, palette = mycols)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="GarageArea", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="GarageArea", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="GarageArea", data=train, palette = mycols);
  • This has an extremely high positive correlation with SalePrice, and it is highly dependant on Neighborhood, building type and style of the house.
  • This could be an important feature in the analysis, so we will bin this feature and create dummy variables.
In [93]:
all_data['GarageArea_Band'] = pd.cut(all_data['GarageArea'], 3)
all_data['GarageArea_Band'].unique()
Out[93]:
[(496.0, 992.0], (-1.488, 496.0], (992.0, 1488.0]]
Categories (3, interval[float64]): [(-1.488, 496.0] < (496.0, 992.0] < (992.0, 1488.0]]
In [94]:
all_data.loc[all_data['GarageArea']<=496, 'GarageArea'] = 1
all_data.loc[(all_data['GarageArea']>496) & (all_data['GarageArea']<=992), 'GarageArea'] = 2
all_data.loc[all_data['GarageArea']>992, 'GarageArea'] = 3
all_data['GarageArea'] = all_data['GarageArea'].astype(int)

all_data.drop('GarageArea_Band', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["GarageArea"], prefix="GarageArea")
all_data.head(3)
Out[94]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... GarageYrBlt_1 GarageYrBlt_2 GarageYrBlt_3 GarageFinish_Fin GarageFinish_None GarageFinish_RFn GarageFinish_Unf GarageArea_1 GarageArea_2 GarageArea_3
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 1 0 0 1 0 0 1 0
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 1 0 0 0 1 0 1 0 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 1 0 0 1 0 0 1 0

3 rows × 238 columns

GarageQual

  • Garage quality.
In [95]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="GarageQual", y="SalePrice", data=train, order=["Po", "Fa", "TA", "Gd", "Ex"], palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="GarageQual", y="SalePrice", data=train, size = 5, jitter = True, order=["Po", "Fa", "TA", "Gd", "Ex"], palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="GarageQual", y="SalePrice", data=train, order=["Po", "Fa", "TA", "Gd", "Ex"], palette = mycols);
  • We see a lot of homes having "TA" quality garages, with very few homes having high quality and low quality ones.
  • We are going to cluster the classes here, and then create dummy variables.
In [96]:
all_data['GarageQual'] = all_data['GarageQual'].map({"None":"None", "Po":"Low", "Fa":"Low", "TA":"TA", "Gd":"High", "Ex":"High"})
all_data['GarageQual'].unique()
Out[96]:
array(['TA', 'Low', 'High', 'None'], dtype=object)
In [97]:
all_data = pd.get_dummies(all_data, columns = ["GarageQual"], prefix="GarageQual")
all_data.head(3)
Out[97]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... GarageFinish_None GarageFinish_RFn GarageFinish_Unf GarageArea_1 GarageArea_2 GarageArea_3 GarageQual_High GarageQual_Low GarageQual_None GarageQual_TA
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 1 0 0 1 0 0 0 0 1
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 1 0 1 0 0 0 0 0 1
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 1 0 0 1 0 0 0 0 1

3 rows × 241 columns

GarageCond

  • Garage condition.
In [98]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="GarageCond", y="SalePrice", data=train, order=["Po", "Fa", "TA", "Gd", "Ex"], palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="GarageCond", y="SalePrice", data=train, size = 5, jitter = True, order=["Po", "Fa", "TA", "Gd", "Ex"], palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="GarageCond", y="SalePrice", data=train, order=["Po", "Fa", "TA", "Gd", "Ex"], palette = mycols);
  • We see a fairly similar pattern here with the previous feature. We see a slight positive correlation and then a dip, we believe due to the low number of houses that have "Ex" or "Gd" garage conditions.
  • Similarly to before, we are going to cluster and then dummy this feature.
In [99]:
all_data['GarageCond'] = all_data['GarageCond'].map({"None":"None", "Po":"Low", "Fa":"Low", "TA":"TA", "Gd":"High", "Ex":"High"})
all_data['GarageCond'].unique()
Out[99]:
array(['TA', 'Low', 'None', 'High'], dtype=object)
In [100]:
all_data = pd.get_dummies(all_data, columns = ["GarageCond"], prefix="GarageCond")
all_data.head(3)
Out[100]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... GarageArea_2 GarageArea_3 GarageQual_High GarageQual_Low GarageQual_None GarageQual_TA GarageCond_High GarageCond_Low GarageCond_None GarageCond_TA
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 1 0 0 0 0 1 0 0 0 1
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 0 1 0 0 0 1
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 1 0 0 0 0 1 0 0 0 1

3 rows × 244 columns

WoodDeckSF

  • Wood deck area in SF.
In [101]:
grid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['WoodDeckSF'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['WoodDeckSF'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="WoodDeckSF", data=train)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="WoodDeckSF", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="WoodDeckSF", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="WoodDeckSF", data=train, palette = mycols);
  • This feature has a high positive correlation with SalePrice.
  • We can also see that it varies widely with location, building type, style and size of the lot.
  • There is a significant number of data points with a value of 0, so we will create a flag to indicate no Wood Deck. Then, since this is a continuous numeric feature, and we believe it to be an important one, we will bin this and then create dummy features.
In [102]:
def WoodDeckFlag(col):
    if col['WoodDeckSF'] == 0:
        return 1
    else:
        return 0
    
all_data['NoWoodDeck_Flag'] = all_data.apply(WoodDeckFlag, axis=1)

all_data['WoodDeckSF_Band'] = pd.cut(all_data['WoodDeckSF'], 4)

all_data.loc[all_data['WoodDeckSF']<=356, 'WoodDeckSF'] = 1
all_data.loc[(all_data['WoodDeckSF']>356) & (all_data['WoodDeckSF']<=712), 'WoodDeckSF'] = 2
all_data.loc[(all_data['WoodDeckSF']>712) & (all_data['WoodDeckSF']<=1068), 'WoodDeckSF'] = 3
all_data.loc[all_data['WoodDeckSF']>1068, 'WoodDeckSF'] = 4
all_data['WoodDeckSF'] = all_data['WoodDeckSF'].astype(int)

all_data.drop('WoodDeckSF_Band', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["WoodDeckSF"], prefix="WoodDeckSF")
all_data.head(3)
Out[102]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... GarageQual_TA GarageCond_High GarageCond_Low GarageCond_None GarageCond_TA NoWoodDeck_Flag WoodDeckSF_1 WoodDeckSF_2 WoodDeckSF_3 WoodDeckSF_4
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 1 0 0 0 1 1 1 0 0 0
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 1 0 0 0 1 0 1 0 0 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 1 0 0 0 1 1 1 0 0 0

3 rows × 248 columns

OpenPorchSF, EnclosedPorch, 3SsnPorch & ScreenPorch

  • We will sum these features together to create a total porch in square feet feature.
In [103]:
all_data['TotalPorchSF'] = all_data['OpenPorchSF'] + all_data['OpenPorchSF'] + all_data['EnclosedPorch'] + all_data['3SsnPorch'] + all_data['ScreenPorch'] 
train['TotalPorchSF'] = train['OpenPorchSF'] + train['OpenPorchSF'] + train['EnclosedPorch'] + train['3SsnPorch'] + train['ScreenPorch']
In [104]:
rid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['TotalPorchSF'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['TotalPorchSF'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="TotalPorchSF", data=train, palette = mycols)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="TotalPorchSF", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="TotalPorchSF", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="TotalPorchSF", data=train, palette = mycols);
  • We can see a high number of data points having a value of 0 here once again.
  • Apart from this, we see a high positive correlation with SalePrice showing that this may be an influential factor for analysis.
  • Finally, we see that this value ranges widely based on location, building type, style and lot.
  • We will create a flag to indicate no open porch, then we will bin the feature and create dummy variables.
In [105]:
def PorchFlag(col):
    if col['TotalPorchSF'] == 0:
        return 1
    else:
        return 0
    
all_data['NoPorch_Flag'] = all_data.apply(PorchFlag, axis=1)

all_data['TotalPorchSF_Band'] = pd.cut(all_data['TotalPorchSF'], 4)
all_data['TotalPorchSF_Band'].unique()
Out[105]:
[(-1.724, 431.0], (431.0, 862.0], (862.0, 1293.0], (1293.0, 1724.0]]
Categories (4, interval[float64]): [(-1.724, 431.0] < (431.0, 862.0] < (862.0, 1293.0] < (1293.0, 1724.0]]
In [106]:
all_data.loc[all_data['TotalPorchSF']<=431, 'TotalPorchSF'] = 1
all_data.loc[(all_data['TotalPorchSF']>431) & (all_data['TotalPorchSF']<=862), 'TotalPorchSF'] = 2
all_data.loc[(all_data['TotalPorchSF']>862) & (all_data['TotalPorchSF']<=1293), 'TotalPorchSF'] = 3
all_data.loc[all_data['TotalPorchSF']>1293, 'TotalPorchSF'] = 4
all_data['TotalPorchSF'] = all_data['TotalPorchSF'].astype(int)

all_data.drop('TotalPorchSF_Band', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["TotalPorchSF"], prefix="TotalPorchSF")
all_data.head(3)
Out[106]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... NoWoodDeck_Flag WoodDeckSF_1 WoodDeckSF_2 WoodDeckSF_3 WoodDeckSF_4 NoPorch_Flag TotalPorchSF_1 TotalPorchSF_2 TotalPorchSF_3 TotalPorchSF_4
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 1 1 0 0 0 0 1 0 0 0
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 1 0 0 0 1 1 0 0 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 1 1 0 0 0 0 1 0 0 0

3 rows × 253 columns

PoolArea

  • Pool area in square feet.
In [107]:
grid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['PoolArea'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['PoolArea'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="PoolArea", data=train, palette = mycols)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="PoolArea", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="PoolArea", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="PoolArea", data=train, palette = mycols);
  • We see almost 0 correlation due to the high number of houses without a pool.
  • Hence, we will create a flag here.
In [108]:
def PoolFlag(col):
    if col['PoolArea'] == 0:
        return 0
    else:
        return 1
    
all_data['HasPool_Flag'] = all_data.apply(PoolFlag, axis=1)
all_data.drop('PoolArea', axis=1, inplace=True)

PoolQC

  • Pool quality.
In [109]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="PoolQC", y="SalePrice", data=train, order=["Fa", "Gd", "Ex"], palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="PoolQC", y="SalePrice", data=train, size = 5, jitter = True, order=["Fa", "Gd", "Ex"], palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="PoolQC", y="SalePrice", data=train, order=["Fa", "Gd", "Ex"], palette = mycols);
  • Due to not many houses having a pool, we see very low numbers of observations for each class.
  • Since this does not hold much information this feature, we will simply remove it.
In [110]:
all_data.drop('PoolQC', axis=1, inplace=True)

Fence

  • Fence quality.
In [111]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="Fence", y="SalePrice", data=train, order = ["MnWw", "GdWo", "MnPrv", "GdPrv"], palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="Fence", y="SalePrice", data=train, size = 5, jitter = True, order = ["MnWw", "GdWo", "MnPrv", "GdPrv"], palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="Fence", y="SalePrice", data=train, order = ["MnWw", "GdWo", "MnPrv", "GdPrv"], palette = mycols);
  • Here we see that the houses with the most privacy have the highest average SalePrice.
  • There seems to be a slight order within the classes, however some of the class descriptions are slightly ambiguous, therefore we will create dummy variables here from this categorical feature.
In [112]:
all_data = pd.get_dummies(all_data, columns = ["Fence"], prefix="Fence")
all_data.head(3)
Out[112]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... TotalPorchSF_1 TotalPorchSF_2 TotalPorchSF_3 TotalPorchSF_4 HasPool_Flag Fence_GdPrv Fence_GdWo Fence_MnPrv Fence_MnWw Fence_None
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 1 0 0 0 0 0 0 0 0 1
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 1 0 0 0 0 0 0 0 0 1
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 1 0 0 0 0 0 0 0 0 1

3 rows × 256 columns

4.2.5 - Location

MSZoning

  • Identifies the general zoning classification of the sale.
In [113]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="MSZoning", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="MSZoning", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="MSZoning", y="SalePrice", data=train, palette = mycols);

Since this a categorical feature without order, and each of the classes has a very different range and average for SalePrice, we will create dummy features here.

In [114]:
all_data = pd.get_dummies(all_data, columns = ["MSZoning"], prefix="MSZoning")
all_data.head(3)
Out[114]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... Fence_GdPrv Fence_GdWo Fence_MnPrv Fence_MnWw Fence_None MSZoning_C (all) MSZoning_FV MSZoning_RH MSZoning_RL MSZoning_RM
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 0 1 0 0 0 1 0
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 1 0 0 0 1 0
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 0 1 0 0 0 1 0

3 rows × 260 columns

Neighborhood

  • Physical locations within Ames city limits.
In [115]:
plt.subplots(figsize =(50, 10))

plt.subplot(1, 3, 1)
sns.boxplot(x="Neighborhood", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="Neighborhood", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="Neighborhood", y="SalePrice", data=train, palette = mycols);
  • Neighborhood clearly has an important contribution towards SalePrice, since we see such high values for certain areas and low values for others.
  • Since this is a categorical feature without order, we will create dummy features.
In [116]:
all_data = pd.get_dummies(all_data, columns = ["Neighborhood"], prefix="Neighborhood")
all_data.head(3)
Out[116]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Condition1 Condition2 Electrical ... Neighborhood_NoRidge Neighborhood_NridgHt Neighborhood_OldTown Neighborhood_SWISU Neighborhood_Sawyer Neighborhood_SawyerW Neighborhood_Somerst Neighborhood_StoneBr Neighborhood_Timber Neighborhood_Veenker
0 0 None 3 3 1 3 Y Norm Norm SBrkr ... 0 0 0 0 0 0 0 0 0 0
1 0 None 3 3 4 3 Y Feedr Norm SBrkr ... 0 0 0 0 0 0 0 0 0 1
2 0 None 3 3 2 3 Y Norm Norm SBrkr ... 0 0 0 0 0 0 0 0 0 0

3 rows × 284 columns

Condition1 & Condition2

  • Proximity to various conditions.
In [117]:
plt.subplots(figsize =(20, 10))

plt.subplot(2, 3, 1)
sns.boxplot(x="Condition1", y="SalePrice", data=train, palette = mycols)

plt.subplot(2, 3, 2)
sns.stripplot(x="Condition1", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(2, 3, 3)
sns.barplot(x="Condition1", y="SalePrice", data=train, palette = mycols);

plt.subplot(2, 3, 4)
sns.boxplot(x="Condition2", y="SalePrice", data=train, palette = mycols)

plt.subplot(2, 3, 5)
sns.stripplot(x="Condition2", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(2, 3, 6)
sns.barplot(x="Condition2", y="SalePrice", data=train, palette = mycols);
  • Since this feature is based around local features, it is understandable that having more desirable things, like a park nearby are a factor that would contribute towards a higher SalePrice.
  • For this feature we are going to cluster the classes based on the class description. Then, we will create dummy features.
  • We will then drop "Condition2" after creating a flag to indicate whether a different condition from the first is nearby.
In [118]:
all_data['Condition1'] = all_data['Condition1'].map({"Norm":"Norm", "Feedr":"Street", "PosN":"Pos", "Artery":"Street", "RRAe":"Train",
                                                    "RRNn":"Train", "RRAn":"Train", "PosA":"Pos", "RRNe":"Train"})
all_data['Condition2'] = all_data['Condition2'].map({"Norm":"Norm", "Feedr":"Street", "PosN":"Pos", "Artery":"Street", "RRAe":"Train",
                                                    "RRNn":"Train", "RRAn":"Train", "PosA":"Pos", "RRNe":"Train"})
In [119]:
def ConditionMatch(col):
    if col['Condition1'] == col['Condition2']:
        return 0
    else:
        return 1
    
all_data['Diff2ndCondition_Flag'] = all_data.apply(ConditionMatch, axis=1)
all_data.drop('Condition2', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["Condition1"], prefix="Condition1")
all_data.head(3)
Out[119]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual ... Neighborhood_SawyerW Neighborhood_Somerst Neighborhood_StoneBr Neighborhood_Timber Neighborhood_Veenker Diff2ndCondition_Flag Condition1_Norm Condition1_Pos Condition1_Street Condition1_Train
0 0 None 3 3 1 3 Y SBrkr 0 3 ... 0 0 0 0 0 0 1 0 0 0
1 0 None 3 3 4 3 Y SBrkr 0 2 ... 0 0 0 0 1 1 0 0 1 0
2 0 None 3 3 2 3 Y SBrkr 0 3 ... 0 0 0 0 0 0 1 0 0 0

3 rows × 287 columns

4.2.6 - Land

LotFrontage

  • Linear feet of street connected to property.
In [120]:
grid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['LotFrontage'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['LotFrontage'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="LotFrontage", data=train, palette = mycols)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="LotFrontage", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="LotFrontage", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="LotFrontage", data=train, palette = mycols);
C:\Users\Derrick\AppData\Roaming\Python\Python36\site-packages\scipy\stats\stats.py:5240: RuntimeWarning: invalid value encountered in less
  x = np.where(x < 1.0, x, 1.0)  # if x > 1 then return 1.0
  • This feature seems to be fairly randomly distributed against SalePrice without any significant correlation.
  • LotFrontage doesn't seem to vary too much based on "Neighborhood", but the "BldgType" does seem to have a affect on the average LotFrontage.
  • Since this feature doesn't seem to show any significance to bin into groupings, we will leave this feature as it is until we scale the features.

LotArea

  • Lot size in square feet.
In [121]:
grid = plt.GridSpec(2, 3, wspace=0.1, hspace=0.15)
plt.subplots(figsize =(30, 15))

plt.subplot(grid[0, 0])
g = sns.regplot(x=train['LotArea'], y=train['SalePrice'], fit_reg=False, label = "corr: %2f"%(pearsonr(train['LotArea'], train['SalePrice'])[0]))
g = g.legend(loc="best")

plt.subplot(grid[0, 1:])
sns.boxplot(x="Neighborhood", y="LotArea", data=train, palette = mycols)

plt.subplot(grid[1, 0]);
sns.barplot(x="BldgType", y="LotArea", data=train, palette = mycols)

plt.subplot(grid[1, 1]);
sns.barplot(x="HouseStyle", y="LotArea", data=train, palette = mycols)

plt.subplot(grid[1, 2]);
sns.barplot(x="LotShape", y="LotArea", data=train, palette = mycols);
  • This feature shows a high correlation but it is very positively skewed.
  • Hence, we will create quantile bins and dummy features. Quantile bins are not based on approximately equal sized bins, instead creating bins with a similar frequency of data points within each bin.
In [122]:
all_data['LotArea_Band'] = pd.qcut(all_data['LotArea'], 8)
all_data['LotArea_Band'].unique()
Out[122]:
[(7474.0, 8520.0], (9450.0, 10355.25], (10355.25, 11554.5], (13613.0, 215245.0], (5684.75, 7474.0], (11554.5, 13613.0], (1299.999, 5684.75], (8520.0, 9450.0]]
Categories (8, interval[float64]): [(1299.999, 5684.75] < (5684.75, 7474.0] < (7474.0, 8520.0] < (8520.0, 9450.0] < (9450.0, 10355.25] < (10355.25, 11554.5] < (11554.5, 13613.0] < (13613.0, 215245.0]]
In [123]:
all_data.loc[all_data['LotArea']<=5684.75, 'LotArea'] = 1
all_data.loc[(all_data['LotArea']>5684.75) & (all_data['LotArea']<=7474), 'LotArea'] = 2
all_data.loc[(all_data['LotArea']>7474) & (all_data['LotArea']<=8520), 'LotArea'] = 3
all_data.loc[(all_data['LotArea']>8520) & (all_data['LotArea']<=9450), 'LotArea'] = 4
all_data.loc[(all_data['LotArea']>9450) & (all_data['LotArea']<=10355.25), 'LotArea'] = 5
all_data.loc[(all_data['LotArea']>10355.25) & (all_data['LotArea']<=11554.25), 'LotArea'] = 6
all_data.loc[(all_data['LotArea']>11554.25) & (all_data['LotArea']<=13613), 'LotArea'] = 7
all_data.loc[all_data['LotArea']>13613, 'LotArea'] = 8
all_data['LotArea'] = all_data['LotArea'].astype(int)

all_data.drop('LotArea_Band', axis=1, inplace=True)

all_data = pd.get_dummies(all_data, columns = ["LotArea"], prefix="LotArea")
all_data.head(3)
Out[123]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual ... Condition1_Street Condition1_Train LotArea_1 LotArea_2 LotArea_3 LotArea_4 LotArea_5 LotArea_6 LotArea_7 LotArea_8
0 0 None 3 3 1 3 Y SBrkr 0 3 ... 0 0 0 0 1 0 0 0 0 0
1 0 None 3 3 4 3 Y SBrkr 0 2 ... 1 0 0 0 0 0 1 0 0 0
2 0 None 3 3 2 3 Y SBrkr 0 3 ... 0 0 0 0 0 0 0 1 0 0

3 rows × 294 columns

LotShape

  • General shape of property.
In [124]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="LotShape", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="LotShape", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="LotShape", y="SalePrice", data=train, palette = mycols);
  • Clearly we see some extreme values for some categories and a varying SalePrice across classes.
  • "Reg" and "IR1" have the highest frequency of data points within them.
  • Since this is a categorical feature without order, we will create dummy features.
In [125]:
all_data = pd.get_dummies(all_data, columns = ["LotShape"], prefix="LotShape")
all_data.head(3)
Out[125]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual ... LotArea_3 LotArea_4 LotArea_5 LotArea_6 LotArea_7 LotArea_8 LotShape_IR1 LotShape_IR2 LotShape_IR3 LotShape_Reg
0 0 None 3 3 1 3 Y SBrkr 0 3 ... 1 0 0 0 0 0 0 0 0 1
1 0 None 3 3 4 3 Y SBrkr 0 2 ... 0 0 1 0 0 0 0 0 0 1
2 0 None 3 3 2 3 Y SBrkr 0 3 ... 0 0 0 1 0 0 1 0 0 0

3 rows × 297 columns

LandContour

  • Flatness of the property
In [126]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="LandContour", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="LandContour", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="LandContour", y="SalePrice", data=train, palette = mycols);
  • Most houses are indeed on a flat contour, however the houses with the highest SalePrice seem to come from properties on a hill interestingly.
  • Since this a categorical feature without order, we will create dummy features.
In [127]:
all_data = pd.get_dummies(all_data, columns = ["LandContour"], prefix="LandContour")
all_data.head(3)
Out[127]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual ... LotArea_7 LotArea_8 LotShape_IR1 LotShape_IR2 LotShape_IR3 LotShape_Reg LandContour_Bnk LandContour_HLS LandContour_Low LandContour_Lvl
0 0 None 3 3 1 3 Y SBrkr 0 3 ... 0 0 0 0 0 1 0 0 0 1
1 0 None 3 3 4 3 Y SBrkr 0 2 ... 0 0 0 0 0 1 0 0 0 1
2 0 None 3 3 2 3 Y SBrkr 0 3 ... 0 0 1 0 0 0 0 0 0 1

3 rows × 300 columns

LotConfig

  • Lot configuration.
In [128]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="LotConfig", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="LotConfig", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="LotConfig", y="SalePrice", data=train, palette = mycols);
  • Cul de sac's seem to boast the highest average prices, however most houses are positioned inside or on the corner of the lot.
  • To simplify this feature we will cluster "FR2" and "FR3", then create dummy features.
In [129]:
all_data['LotConfig'] = all_data['LotConfig'].map({"Inside":"Inside", "FR2":"FR", "Corner":"Corner", "CulDSac":"CulDSac", "FR3":"FR"})

all_data = pd.get_dummies(all_data, columns = ["LotConfig"], prefix="LotConfig")
all_data.head(3)
Out[129]:
3SsnPorch Alley BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual ... LotShape_IR3 LotShape_Reg LandContour_Bnk LandContour_HLS LandContour_Low LandContour_Lvl LotConfig_Corner LotConfig_CulDSac LotConfig_FR LotConfig_Inside
0 0 None 3 3 1 3 Y SBrkr 0 3 ... 0 1 0 0 0 1 0 0 0 1
1 0 None 3 3 4 3 Y SBrkr 0 2 ... 0 1 0 0 0 1 0 0 1 0
2 0 None 3 3 2 3 Y SBrkr 0 3 ... 0 0 0 0 0 1 0 0 0 1

3 rows × 303 columns

LandSlope

  • Slope of property.
In [130]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="LandSlope", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="LandSlope", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="LandSlope", y="SalePrice", data=train, palette = mycols);
  • We see that most houses have a gentle slope of land and overall, the severity of the slope doesn't appear to have much of an impact on SalePrice.
  • Hence, we are going to cluster "Mod" and "Sev" to create one class, and create a new flag to indicate a gentle slope or not.
In [131]:
all_data['LandSlope'] = all_data['LandSlope'].map({"Gtl":1, "Mod":2, "Sev":2})
def Slope(col):
    if col['LandSlope'] == 1:
        return 1
    else:
        return 0
    
all_data['GentleSlope_Flag'] = all_data.apply(Slope, axis=1)
all_data.drop('LandSlope', axis=1, inplace=True)

4.2.7 - Access

Street

  • Type of road access to the property.
In [132]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="Street", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="Street", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="Street", y="SalePrice", data=train, palette = mycols);
  • With such a lower number of observations being assigned to the class "Grvl" it is redundant within the model.
  • Hence, we will drop this feature.
In [133]:
all_data.drop('Street', axis=1, inplace=True)

Alley

  • Type of alley access to the property.
In [134]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="Alley", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="Alley", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="Alley", y="SalePrice", data=train, palette = mycols);
  • Here we see a fairly even split between to two classes in terms of frequency, but a much higher average SalePrice for Paved alleys as opposed to Gravel ones.
  • Hence, this seems as though it could be a good predictor. we will create dummy features from this.
In [135]:
all_data = pd.get_dummies(all_data, columns = ["Alley"], prefix="Alley")
all_data.head(3)
Out[135]:
3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual FireplaceQu ... LandContour_Low LandContour_Lvl LotConfig_Corner LotConfig_CulDSac LotConfig_FR LotConfig_Inside GentleSlope_Flag Alley_Grvl Alley_None Alley_Pave
0 0 3 3 1 3 Y SBrkr 0 3 0 ... 0 1 0 0 0 1 1 0 1 0
1 0 3 3 4 3 Y SBrkr 0 2 3 ... 0 1 0 0 1 0 1 0 1 0
2 0 3 3 2 3 Y SBrkr 0 3 3 ... 0 1 0 0 0 1 1 0 1 0

3 rows × 304 columns

PavedDrive

  • Paved driveway.
In [136]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="PavedDrive", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="PavedDrive", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="PavedDrive", y="SalePrice", data=train, palette = mycols);
  • Here we see the highest average price being demanded from houses with a paved driveway, and most houses in this area seem to have one.
  • Since this is a categorical feature without order, we will create dummy variables.
In [137]:
all_data = pd.get_dummies(all_data, columns = ["PavedDrive"], prefix="PavedDrive")
all_data.head(3)
Out[137]:
3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual FireplaceQu ... LotConfig_CulDSac LotConfig_FR LotConfig_Inside GentleSlope_Flag Alley_Grvl Alley_None Alley_Pave PavedDrive_N PavedDrive_P PavedDrive_Y
0 0 3 3 1 3 Y SBrkr 0 3 0 ... 0 0 1 1 0 1 0 0 0 1
1 0 3 3 4 3 Y SBrkr 0 2 3 ... 0 1 0 1 0 1 0 0 0 1
2 0 3 3 2 3 Y SBrkr 0 3 3 ... 0 0 1 1 0 1 0 0 0 1

3 rows × 306 columns

4.2.8 - Utilities

Heating

Type of heating.

In [138]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="Heating", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="Heating", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="Heating", y="SalePrice", data=train, palette = mycols);
  • We see the highest frequency and highest average SalePrice coming from "GasA" and a very low frequency from all other classes.
  • Hence, we will create a flag to indicate whether "GasA" is present or not.
In [139]:
all_data['GasA_Flag'] = all_data['Heating'].map({"GasA":1, "GasW":0, "Grav":0, "Wall":0, "OthW":0, "Floor":0})
all_data.drop('Heating', axis=1, inplace=True)
all_data.head(3)
Out[139]:
3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir Electrical EnclosedPorch ExterQual FireplaceQu ... LotConfig_FR LotConfig_Inside GentleSlope_Flag Alley_Grvl Alley_None Alley_Pave PavedDrive_N PavedDrive_P PavedDrive_Y GasA_Flag
0 0 3 3 1 3 Y SBrkr 0 3 0 ... 0 1 1 0 1 0 0 0 1 1
1 0 3 3 4 3 Y SBrkr 0 2 3 ... 1 0 1 0 1 0 0 0 1 1
2 0 3 3 2 3 Y SBrkr 0 3 3 ... 0 1 1 0 1 0 0 0 1 1

3 rows × 306 columns

HeatingQC

  • Heating quality and condition.
In [140]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="HeatingQC", y="SalePrice", data=train, order=["Po", "Fa", "TA", "Gd", "Ex"], palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="HeatingQC", y="SalePrice", data=train, size = 5, jitter = True, order=["Po", "Fa", "TA", "Gd", "Ex"], palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="HeatingQC", y="SalePrice", data=train, order=["Po", "Fa", "TA", "Gd", "Ex"], palette = mycols);
  • Here we see a positive correlation with SalePrice as the heating quality increases. With "Ex" bringing the highest average SalePrice.
  • We also see a high number of houses with this heating quality too, which means most houses had very good heating!
  • This is a categorical feature, however because it exhibits an order, we will replace the values manually with numbers.
In [141]:
all_data['HeatingQC'] = all_data['HeatingQC'].map({"Po":1, "Fa":2, "TA":3, "Gd":4, "Ex":5})
all_data['HeatingQC'].unique()
Out[141]:
array([5, 4, 3, 2, 1], dtype=int64)

CentralAir

  • Central air conditioning.
In [142]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="CentralAir", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="CentralAir", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="CentralAir", y="SalePrice", data=train, palette = mycols);
  • We see that houses with central air conditioning are able to demand a higher average SalePrice than ones without.
  • For this feature, we will simply replace the categories with numbers 0 and 1.
In [143]:
all_data['CentralAir'] = all_data['CentralAir'].map({"Y":1, "N":0})
all_data['CentralAir'].unique()
Out[143]:
array([1, 0], dtype=int64)

Electrical

  • Electrical system.
In [144]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="Electrical", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="Electrical", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="Electrical", y="SalePrice", data=train, palette = mycols);
  • We see the highest average SalePrice coming from houses with "SBrkr" electrics, and these are also the most frequent electrical systems installed in the houses from this area.
  • We have 2 categories in particular that have very low frequencies, "FuseP" and "Mix".
  • We are going to cluster all the classes related to fuses, and the "Mix" class will probably be removed during feature reduction.
In [145]:
all_data['Electrical'] = all_data['Electrical'].map({"SBrkr":"SBrkr", "FuseF":"Fuse", "FuseA":"Fuse", "FuseP":"Fuse", "Mix":"Mix"})

all_data = pd.get_dummies(all_data, columns = ["Electrical"], prefix="Electrical")
all_data.head(3)
Out[145]:
3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir EnclosedPorch ExterQual FireplaceQu Fireplaces ... Alley_Grvl Alley_None Alley_Pave PavedDrive_N PavedDrive_P PavedDrive_Y GasA_Flag Electrical_Fuse Electrical_Mix Electrical_SBrkr
0 0 3 3 1 3 1 0 3 0 0 ... 0 1 0 0 0 1 1 0 0 1
1 0 3 3 4 3 1 0 2 3 1 ... 0 1 0 0 0 1 1 0 0 1
2 0 3 3 2 3 1 0 3 3 1 ... 0 1 0 0 0 1 1 0 0 1

3 rows × 308 columns

4.2.9 - Miscellaneous

MiscFeature

  • Miscellaneous feature not covered in other categories.
In [146]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="MiscFeature", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="MiscFeature", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="MiscFeature", y="SalePrice", data=train, palette = mycols);
  • We can see here that only a low number of houses in this area with any miscalleanous features. Hence, we do not believe that this feature holds much.
  • Therefore we will drop this feature along with MiscVal.
In [147]:
columns=['MiscFeature', 'MiscVal']
all_data.drop(columns, axis=1, inplace=True)

MoSold

  • Month sold (MM).
In [148]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="MoSold", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="MoSold", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="MoSold", y="SalePrice", data=train, palette = mycols);
  • Although this feature is a numeric feature, it should really be a category.
  • We can see that there is no real indicator as to any months that consistetly sold houses of a higher price, however there does seem to be a fairly even distribution of values between classes.
  • We will create dummy variables from each category.
In [149]:
all_data = pd.get_dummies(all_data, columns = ["MoSold"], prefix="MoSold")
all_data.head(3)
Out[149]:
3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir EnclosedPorch ExterQual FireplaceQu Fireplaces ... MoSold_3 MoSold_4 MoSold_5 MoSold_6 MoSold_7 MoSold_8 MoSold_9 MoSold_10 MoSold_11 MoSold_12
0 0 3 3 1 3 1 0 3 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 3 3 4 3 1 0 2 3 1 ... 0 0 1 0 0 0 0 0 0 0
2 0 3 3 2 3 1 0 3 3 1 ... 0 0 0 0 0 0 1 0 0 0

3 rows × 317 columns

YrSold

  • Year sold (YYYY).
In [150]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="YrSold", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="YrSold", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="YrSold", y="SalePrice", data=train, palette = mycols);
  • Here we see just a 5 year time period of which the houses in this dataset were sold.
  • There is a n even distribution of values between each class, and each year has a very similar average SalePrice.
  • Even though this is numeric, it should be categorical. Therefore we will create dummy variables.
In [151]:
all_data = pd.get_dummies(all_data, columns = ["YrSold"], prefix="YrSold")
all_data.head(3)
Out[151]:
3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir EnclosedPorch ExterQual FireplaceQu Fireplaces ... MoSold_8 MoSold_9 MoSold_10 MoSold_11 MoSold_12 YrSold_2006 YrSold_2007 YrSold_2008 YrSold_2009 YrSold_2010
0 0 3 3 1 3 1 0 3 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0 3 3 4 3 1 0 2 3 1 ... 0 0 0 0 0 0 1 0 0 0
2 0 3 3 2 3 1 0 3 3 1 ... 0 1 0 0 0 0 0 1 0 0

3 rows × 321 columns

SaleType

  • Type of sale.
In [152]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="SaleType", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="SaleType", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="SaleType", y="SalePrice", data=train, palette = mycols);
  • Most houses were sold under the "WD" category, being a conventional sale, however the highest SalePrice was seen from houses that were sold as houses that were brand new and just sold.
  • For this feature, we will cluster some categories together and then create dummy features.
In [153]:
all_data['SaleType'] = all_data['SaleType'].map({"WD":"WD", "New":"New", "COD":"COD", "CWD":"CWD", "ConLD":"Oth", "ConLI":"Oth", 
                                                 "ConLw":"Oth", "Con":"Oth", "Oth":"Oth"})

all_data = pd.get_dummies(all_data, columns = ["SaleType"], prefix="SaleType")
all_data.head(3)
Out[153]:
3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir EnclosedPorch ExterQual FireplaceQu Fireplaces ... YrSold_2006 YrSold_2007 YrSold_2008 YrSold_2009 YrSold_2010 SaleType_COD SaleType_CWD SaleType_New SaleType_Oth SaleType_WD
0 0 3 3 1 3 1 0 3 0 0 ... 0 0 1 0 0 0 0 0 0 1
1 0 3 3 4 3 1 0 2 3 1 ... 0 1 0 0 0 0 0 0 0 1
2 0 3 3 2 3 1 0 3 3 1 ... 0 0 1 0 0 0 0 0 0 1

3 rows × 325 columns

SaleCondition

  • Condition of sale.
In [154]:
plt.subplots(figsize =(20, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x="SaleCondition", y="SalePrice", data=train, palette = mycols)

plt.subplot(1, 3, 2)
sns.stripplot(x="SaleCondition", y="SalePrice", data=train, size = 5, jitter = True, palette = mycols);

plt.subplot(1, 3, 3)
sns.barplot(x="SaleCondition", y="SalePrice", data=train, palette = mycols);
  • Here we see the largest average SalePrice being associated with partial sales, and the most frequent sale seems to be the normal sales.
  • Since this is a categorical feature without order, we will create dummy features.
In [155]:
all_data = pd.get_dummies(all_data, columns = ["SaleCondition"], prefix="SaleCondition")
all_data.head(3)
Out[155]:
3SsnPorch BedroomAbvGr BsmtCond BsmtExposure BsmtQual CentralAir EnclosedPorch ExterQual FireplaceQu Fireplaces ... SaleType_CWD SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 0 3 3 1 3 1 0 3 0 0 ... 0 0 0 1 0 0 0 0 1 0
1 0 3 3 4 3 1 0 2 3 1 ... 0 0 0 1 0 0 0 0 1 0
2 0 3 3 2 3 1 0 3 3 1 ... 0 0 0 1 0 0 0 0 1 0

3 rows × 330 columns

4.3 - Target Variable

  • We are going to check the distribution of the target variable, and numeric variables, when building a regression model. Machine Learning algorithms work well with features that are normally distributed, a distribution that is symmetric and has a characteristic bell shape.
In [156]:
plt.subplots(figsize=(15, 10))
g = sns.distplot(train['SalePrice'], fit=norm, label = "Skewness : %.2f"%(train['SalePrice'].skew()));
g = g.legend(loc="best")
  • The distribution of the target variable is positively skewed, meaning that the mode is always less than the mean and median.

  • In order to transform this variable into a distribution that looks closer to the black line shown above, we can use the numpy function log1p which applies log(1+x) to all elements within the feature.

In [157]:
train["SalePrice"] = np.log1p(train["SalePrice"])
y_train = train["SalePrice"]

#Check the new distribution 
plt.subplots(figsize=(15, 10))
g = sns.distplot(train['SalePrice'], fit=norm, label = "Skewness : %.2f"%(train['SalePrice'].skew()));
g = g.legend(loc="best")
  • We can see from the skewness and the plot that it follows much more closely to the normal distribution now. This will help the algorithms work most reliably because we are now predicting a distribution that is well-known, i.e. the normal distribution.Since we have transformed our target variable to log, we can convert the value back after the prediction by taking the expotential.

4.4 - Treating skewed features

  • Previously we were treating the target variable, SalePrice. We will do similarly for the features.
In [158]:
# First lets single out the numeric features
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check how skewed they are
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)

plt.subplots(figsize =(90, 30))
skewed_feats.plot(kind='bar');
  • Clearly, we have a variety of positive and negative skewing features. Now we will transform the features with skew > 0.5 to follow more closely the normal distribution.

  • Note: We are using the Box-Cox transformation to transform non-normal variables into a normal shape. Normality is an important assumption for many statistical techniques; if the data isn't normal, applying a Box-Cox means that you are able to run a broader number of tests.

In [159]:
skewness = skewed_feats[abs(skewed_feats) > 0.5]

skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    all_data[feat] = boxcox1p(all_data[feat], lam)

print(skewness.shape[0],  "skewed numerical features have been Box-Cox transformed")
302 skewed numerical features have been Box-Cox transformed

5. Model

5.1 - Preparation of data

  • We will reduce the number of features by using XGBoost's inbuilt feature importance functionality.
In [160]:
# First, re-create the training and test datasets
train = all_data[:ntrain]
test = all_data[ntrain:]

print(train.shape)
print(test.shape)
(1456, 330)
(1459, 330)
In [161]:
import xgboost as xgb

model = xgb.XGBRegressor()
model.fit(train, y_train)

# Sort feature importances
indices = np.argsort(model.feature_importances_)[::-1]
indices = indices[:75]

# Visualise these with a barplot
plt.subplots(figsize=(20, 15))
g = sns.barplot(y=train.columns[indices], x = model.feature_importances_[indices], orient='h', palette = mycols)
g.set_xlabel("Relative importance",fontsize=12)
g.set_ylabel("Features",fontsize=12)
g.tick_params(labelsize=9)
g.set_title("XGB feature importance");
In [162]:
xgb_train = train.copy()
xgb_test = test.copy()

import xgboost as xgb
model = xgb.XGBRegressor()
model.fit(xgb_train, y_train)

# Allow the feature importances attribute to select the most important features
xgb_feat_red = SelectFromModel(model, prefit = True)

# Reduce estimation, validation and test datasets
xgb_train = xgb_feat_red.transform(xgb_train)
xgb_test = xgb_feat_red.transform(xgb_test)


print("Results of 'feature_importances_':")
print('X_train: ', xgb_train.shape, '\nX_test: ', xgb_test.shape)
Results of 'feature_importances_':
X_train:  (1456, 71) 
X_test:  (1459, 71)
In [163]:
# Next we want to sample our training data to test for performance of robustness ans accuracy, before applying to the test data
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(xgb_train, y_train, test_size=0.3, random_state=42)

# X_train = predictor features for estimation dataset
# X_test = predictor features for validation dataset
# Y_train = target variable for the estimation dataset
# Y_test = target variable for the estimation dataset

print('X_train: ', X_train.shape, '\nX_test: ', X_test.shape, '\nY_train: ', Y_train.shape, '\nY_test: ', Y_test.shape)
X_train:  (1019, 71) 
X_test:  (437, 71) 
Y_train:  (1019,) 
Y_test:  (437,)

5.2 - Training

For this analysis we are using 8 different algorithms:

  • Kernel Ridge Regression
  • Elastic Net
  • Lasso
  • Gradient Boosting
  • Bayesian Ridge
  • Lasso Lars IC
  • Random Forest Regressor
  • XGBoost

The method of measuring accuracy was chosen to be Root Mean Squared Error which was mentioned in the competition details. We wull use the inbulit function in scikit learn.

In [164]:
import xgboost as xgb
#Machine Learning Algorithm (MLA) Selection and Initialization
models = [KernelRidge(), ElasticNet(), Lasso(), GradientBoostingRegressor(), BayesianRidge(), LassoLarsIC(), RandomForestRegressor(), xgb.XGBRegressor()]

# First we will use ShuffleSplit as a way of randomising the cross validation samples.
shuff = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

#create table to compare algos and parameters
columns = ['Name', 'Parameters', 'Train Accuracy Mean', 'Test Accuracy']
before_model_compare = pd.DataFrame(columns = columns)

#index through models and save performance to table
row_index = 0
for alg in models:

    #set name and parameters
    model_name = alg.__class__.__name__
    before_model_compare.loc[row_index, 'Name'] = model_name
    before_model_compare.loc[row_index, 'Parameters'] = str(alg.get_params())
    
    alg.fit(X_train, Y_train)
    
    #score model with cross validation
    training_results = np.sqrt((-cross_val_score(alg, X_train, Y_train, cv = shuff, scoring= 'neg_mean_squared_error')).mean())
    test_results = np.sqrt(((Y_test-alg.predict(X_test))**2).mean())
    
    before_model_compare.loc[row_index, 'Train Accuracy Mean'] = (training_results)*100
    before_model_compare.loc[row_index, 'Test Accuracy'] = (test_results)*100
    
    row_index+=1
    print(row_index, alg.__class__.__name__, 'trained...')

decimals = 3
before_model_compare['Train Accuracy Mean'] = before_model_compare['Train Accuracy Mean'].apply(lambda x: round(x, decimals))
before_model_compare['Test Accuracy'] = before_model_compare['Test Accuracy'].apply(lambda x: round(x, decimals))
before_model_compare
1 KernelRidge trained...
2 ElasticNet trained...
3 Lasso trained...
4 GradientBoostingRegressor trained...
5 BayesianRidge trained...
6 LassoLarsIC trained...
7 RandomForestRegressor trained...
8 XGBRegressor trained...
Out[164]:
Name Parameters Train Accuracy Mean Test Accuracy
0 KernelRidge {'alpha': 1, 'coef0': 1, 'degree': 3, 'gamma':... 30.763 32.917
1 ElasticNet {'alpha': 1.0, 'copy_X': True, 'fit_intercept'... 22.081 22.351
2 Lasso {'alpha': 1.0, 'copy_X': True, 'fit_intercept'... 27.211 27.333
3 GradientBoostingRegressor {'alpha': 0.9, 'criterion': 'friedman_mse', 'i... 12.307 12.362
4 BayesianRidge {'alpha_1': 1e-06, 'alpha_2': 1e-06, 'compute_... 11.229 11.759
5 LassoLarsIC {'copy_X': True, 'criterion': 'aic', 'eps': 2.... 12.552 12.511
6 RandomForestRegressor {'bootstrap': True, 'criterion': 'mse', 'max_d... 14.706 13.897
7 XGBRegressor {'base_score': 0.5, 'booster': 'gbtree', 'cols... 12.542 12.421

5.3 - Optimisation

  • We will use GridSearchCV to find the best combinations of parameters to produce the highest scoring models.
In [165]:
models = [KernelRidge(), ElasticNet(), Lasso(), GradientBoostingRegressor(), BayesianRidge(), LassoLarsIC(), RandomForestRegressor(), xgb.XGBRegressor()]

KR_param_grid = {'alpha': [0.1], 'coef0': [100], 'degree': [1], 'gamma': [None], 'kernel': ['polynomial']}
EN_param_grid = {'alpha': [0.001], 'copy_X': [True], 'l1_ratio': [0.6], 'fit_intercept': [True], 'normalize': [False], 
                         'precompute': [False], 'max_iter': [300], 'tol': [0.001], 'selection': ['random'], 'random_state': [None]}
LASS_param_grid = {'alpha': [0.0005], 'copy_X': [True], 'fit_intercept': [True], 'normalize': [False], 'precompute': [False], 
                    'max_iter': [300], 'tol': [0.01], 'selection': ['random'], 'random_state': [None]}
GB_param_grid = {'loss': ['huber'], 'learning_rate': [0.1], 'n_estimators': [300], 'max_depth': [3], 
                                        'min_samples_split': [0.0025], 'min_samples_leaf': [5]}
BR_param_grid = {'n_iter': [200], 'tol': [0.00001], 'alpha_1': [0.00000001], 'alpha_2': [0.000005], 'lambda_1': [0.000005], 
                 'lambda_2': [0.00000001], 'copy_X': [True]}
LL_param_grid = {'criterion': ['aic'], 'normalize': [True], 'max_iter': [100], 'copy_X': [True], 'precompute': ['auto'], 'eps': [0.000001]}
RFR_param_grid = {'n_estimators': [50], 'max_features': ['auto'], 'max_depth': [None], 'min_samples_split': [5], 'min_samples_leaf': [2]}
XGB_param_grid = {'max_depth': [3], 'learning_rate': [0.1], 'n_estimators': [300], 'booster': ['gbtree'], 'gamma': [0], 'reg_alpha': [0.1],
                  'reg_lambda': [0.7], 'max_delta_step': [0], 'min_child_weight': [1], 'colsample_bytree': [0.5], 'colsample_bylevel': [0.2],
                  'scale_pos_weight': [1]}
params_grid = [KR_param_grid, EN_param_grid, LASS_param_grid, GB_param_grid, BR_param_grid, LL_param_grid, RFR_param_grid, XGB_param_grid]

after_model_compare = pd.DataFrame(columns = columns)

row_index = 0
for alg in models:
    
    gs_alg = GridSearchCV(alg, param_grid = params_grid[0], cv = shuff, scoring = 'neg_mean_squared_error', n_jobs=-1)
    params_grid.pop(0)

    #set name and parameters
    model_name = alg.__class__.__name__
    after_model_compare.loc[row_index, 'Name'] = model_name
    
    gs_alg.fit(X_train, Y_train)
    gs_best = gs_alg.best_estimator_
    after_model_compare.loc[row_index, 'Parameters'] = str(gs_alg.best_params_)
    
    #score model with cross validation
    after_training_results = np.sqrt(-gs_alg.best_score_)
    after_test_results = np.sqrt(((Y_test-gs_alg.predict(X_test))**2).mean())
    
    after_model_compare.loc[row_index, 'Train Accuracy Mean'] = (after_training_results)*100
    after_model_compare.loc[row_index, 'Test Accuracy'] = (after_test_results)*100
    
    row_index+=1
    print(row_index, alg.__class__.__name__, 'trained...')

decimals = 3
after_model_compare['Train Accuracy Mean'] = after_model_compare['Train Accuracy Mean'].apply(lambda x: round(x, decimals))
after_model_compare['Test Accuracy'] = after_model_compare['Test Accuracy'].apply(lambda x: round(x, decimals))
after_model_compare
1 KernelRidge trained...
2 ElasticNet trained...
3 Lasso trained...
4 GradientBoostingRegressor trained...
5 BayesianRidge trained...
6 LassoLarsIC trained...
7 RandomForestRegressor trained...
8 XGBRegressor trained...
Out[165]:
Name Parameters Train Accuracy Mean Test Accuracy
0 KernelRidge {'alpha': 0.1, 'coef0': 100, 'degree': 1, 'gam... 11.212 11.911
1 ElasticNet {'alpha': 0.001, 'copy_X': True, 'fit_intercep... 11.218 11.911
2 Lasso {'alpha': 0.0005, 'copy_X': True, 'fit_interce... 11.196 11.771
3 GradientBoostingRegressor {'learning_rate': 0.1, 'loss': 'huber', 'max_d... 12.092 12.153
4 BayesianRidge {'alpha_1': 1e-08, 'alpha_2': 5e-06, 'copy_X':... 11.229 11.759
5 LassoLarsIC {'copy_X': True, 'criterion': 'aic', 'eps': 1e... 12.552 12.511
6 RandomForestRegressor {'max_depth': None, 'max_features': 'auto', 'm... 13.764 13.867
7 XGBRegressor {'booster': 'gbtree', 'colsample_bylevel': 0.2... 12.073 11.789

5.4 - Stacking

We use the best average performing model as the meta-model. All other models will be used as base estimators.

In [166]:
models = [KernelRidge(), ElasticNet(), Lasso(), GradientBoostingRegressor(), BayesianRidge(), LassoLarsIC(), RandomForestRegressor(), xgb.XGBRegressor()]
names = ['KernelRidge', 'ElasticNet', 'Lasso', 'Gradient Boosting', 'Bayesian Ridge', 'Lasso Lars IC', 'Random Forest', 'XGBoost']
params_grid = [KR_param_grid, EN_param_grid, LASS_param_grid, GB_param_grid, BR_param_grid, LL_param_grid, RFR_param_grid, XGB_param_grid]
stacked_validation_train = pd.DataFrame()
stacked_test_train = pd.DataFrame()

row_index=0

for alg in models:
    
    gs_alg = GridSearchCV(alg, param_grid = params_grid[0], cv = shuff, scoring = 'neg_mean_squared_error', n_jobs=-1)
    params_grid.pop(0)
    
    gs_alg.fit(X_train, Y_train)
    gs_best = gs_alg.best_estimator_
    stacked_validation_train.insert(loc = row_index, column = names[0], value = gs_best.predict(X_test))
    print(row_index+1, alg.__class__.__name__, 'predictions added to stacking validation dataset...')
    
    stacked_test_train.insert(loc = row_index, column = names[0], value = gs_best.predict(xgb_test))
    print(row_index+1, alg.__class__.__name__, 'predictions added to stacking test dataset...')
    print("-"*50)
    names.pop(0)
    
    row_index+=1
    
print('Done')
1 KernelRidge predictions added to stacking validation dataset...
1 KernelRidge predictions added to stacking test dataset...
--------------------------------------------------
2 ElasticNet predictions added to stacking validation dataset...
2 ElasticNet predictions added to stacking test dataset...
--------------------------------------------------
3 Lasso predictions added to stacking validation dataset...
3 Lasso predictions added to stacking test dataset...
--------------------------------------------------
4 GradientBoostingRegressor predictions added to stacking validation dataset...
4 GradientBoostingRegressor predictions added to stacking test dataset...
--------------------------------------------------
5 BayesianRidge predictions added to stacking validation dataset...
5 BayesianRidge predictions added to stacking test dataset...
--------------------------------------------------
6 LassoLarsIC predictions added to stacking validation dataset...
6 LassoLarsIC predictions added to stacking test dataset...
--------------------------------------------------
7 RandomForestRegressor predictions added to stacking validation dataset...
7 RandomForestRegressor predictions added to stacking test dataset...
--------------------------------------------------
8 XGBRegressor predictions added to stacking validation dataset...
8 XGBRegressor predictions added to stacking test dataset...
--------------------------------------------------
Done
In [167]:
# First drop the Lasso results from the table, as we will be using Lasso as the meta-model
drop = ['Lasso']
stacked_validation_train.drop(drop, axis=1, inplace=True)
stacked_test_train.drop(drop, axis=1, inplace=True)

# Now fit the meta model and generate predictions
meta_model = make_pipeline(RobustScaler(), Lasso(alpha=0.00001, copy_X = True, fit_intercept = True,
                                              normalize = False, precompute = False, max_iter = 10000,
                                              tol = 0.0001, selection = 'random', random_state = None))
meta_model.fit(stacked_validation_train, Y_test)

meta_model_pred = np.expm1(meta_model.predict(stacked_test_train))
print("Meta-model trained and applied!...")
Meta-model trained and applied!...

5.5 - Ensemble

  • Using the meta-model that we created, we will combine it with the results of the individual optimised models to create an ensemble.
In [168]:
models = [KernelRidge(), ElasticNet(), Lasso(), GradientBoostingRegressor(), BayesianRidge(), LassoLarsIC(), RandomForestRegressor(), xgb.XGBRegressor()]
names = ['KernelRidge', 'ElasticNet', 'Lasso', 'Gradient Boosting', 'Bayesian Ridge', 'Lasso Lars IC', 'Random Forest', 'XGBoost']
params_grid = [KR_param_grid, EN_param_grid, LASS_param_grid, GB_param_grid, BR_param_grid, LL_param_grid, RFR_param_grid, XGB_param_grid]
final_predictions = pd.DataFrame()

row_index=0

for alg in models:
    
    gs_alg = GridSearchCV(alg, param_grid = params_grid[0], cv = shuff, scoring = 'neg_mean_squared_error', n_jobs=-1)
    params_grid.pop(0)
    
    gs_alg.fit(stacked_validation_train, Y_test)
    gs_best = gs_alg.best_estimator_
    final_predictions.insert(loc = row_index, column = names[0], value = np.expm1(gs_best.predict(stacked_test_train)))
    print(row_index+1, alg.__class__.__name__, 'final results predicted added to table...')
    names.pop(0)
    
    row_index+=1

print("-"*50)
print("Done")
    
final_predictions.head()
1 KernelRidge final results predicted added to table...
2 ElasticNet final results predicted added to table...
3 Lasso final results predicted added to table...
4 GradientBoostingRegressor final results predicted added to table...
5 BayesianRidge final results predicted added to table...
6 LassoLarsIC final results predicted added to table...
7 RandomForestRegressor final results predicted added to table...
8 XGBRegressor final results predicted added to table...
--------------------------------------------------
Done
Out[168]:
KernelRidge ElasticNet Lasso Gradient Boosting Bayesian Ridge Lasso Lars IC Random Forest XGBoost
0 121659.839306 122175.999052 121154.020393 122346.579182 122643.509585 121562.527731 120643.690548 117312.320312
1 168328.787247 168721.081964 168947.496958 166983.580024 169165.539758 168504.875143 163386.369421 164382.484375
2 184593.504462 184704.300878 184307.989725 182291.590986 185741.386927 184839.283589 182439.712747 184002.875000
3 196516.385900 196645.432502 196717.849384 182724.854127 197795.948662 196820.221613 179683.560117 185633.421875
4 183829.139764 183006.895843 184397.427130 182774.486786 182311.056400 182514.676524 181120.009075 183656.640625
  • Some models will be much better at catching certain signals in the data, whereas others may perform better in other situations.
  • By creating an ensemble of all of these results, it helps to create a more generalised model that is resistant to noise.
In [169]:
ensemble = meta_model_pred*(1/10) + final_predictions['XGBoost']*(1.5/10) + final_predictions['Gradient Boosting']*(2/10) + final_predictions['Bayesian Ridge']*(1/10) + final_predictions['Lasso']*(1/10) + final_predictions['KernelRidge']*(1/10) + final_predictions['Lasso Lars IC']*(1/10) + final_predictions['Random Forest']*(1.5/10)

submission = pd.DataFrame()
submission['Id'] = test_ID
submission['SalePrice'] = ensemble
submission.to_csv('final_submission.csv',index=False)
print("Submission file, created!")
Submission file, created!